Building for scale: rethinking our processing engine

Yuriy Yunikov

Senior Engineering Manager

6 min read

CONTENTS

A/B testing infographic

Primer is a young company, but our growth has been nothing short of staggering. We made our first production payment in April 2021. Fast-forward to today, and we’re handling a huge volume of payments a week for some of the world’s biggest companies—more than doubling our volume in just the past year alone.

But scaling at this pace has not been without its challenges. As our growth accelerated, we quickly recognized that our processing engine would need to evolve to handle the massive increase in payment volume.

Our processing engine is the system that handles payment processing, powering a core set of services responsible for executing key payment operations within Primer. These include:

Providing a unified approach to handling different payment methods
Managing states of payments
Enabling processor-agnostic tokenization
Handling amounts and currency conversions
Storing payments in our database layer
Submitting events for related systems like webhooks and workflows
Supporting 3D Secure authentication

To stay ahead of our anticipated growth, in 2022, we began a project to refine our processing engine, ensuring it could support this scale while maintaining the reliability and performance merchants expect from their underlying payments infrastructure.

Our first step? Revisiting the engine’s very foundation, the way we store data.

Redesigning the database layer

There’s no way around it—our database layer needed a significant redesign. Like many fast-scaling teams, we made early decisions prioritizing speed over long-term scalability. At the time, our focus was on evolving quickly to meet merchant needs and get them live with Primer. But as we grew, it became clear that some choices weren’t sustainable.

We ended up with a complex database schema built on Amazon RDS, packed with numerous entities, relationships, and events that we had to handle. Not only was this difficult to manage, but it also wasn’t optimized for cost or our future use cases. As our scale increased, so did the inefficiencies, making it clear that it was time to rethink our approach.

The first step was to choose a database that could scale seamlessly across regions as our business becomes more global, ensuring better performance, high availability, and fault tolerance. In short, we needed a distributed solution that could grow with our needs while maintaining reliability, and after evaluating our options against these criteria, we chose CockroachDB.

Choosing the right technology was the easy part—integrating it into our systems was the real challenge. Many of our engineers had never worked with CockroachDB before. To bridge this gap, we focused on education first, running training sessions and workshops while collaborating closely with the CockroachDB team to ensure a smooth adoption.

Arguably, the most critical decision we made—beyond choosing a new database provider—was redesigning some of our data models. We decided to clearly separate Primer's two core concepts: payments and transactions.

In our new model, a transaction represents the unified state of a third-party payment request across different PSPs and payment methods. On the other hand, a payment serves as Primer’s abstraction layer above transactions, capable of combining multiple transactions into one. The status of those transactions determines the payment state, enabling key Primer functionalities such as Fallbacks, agnostic 3D Secure, network tokenization, and more.

This redesign established a more scalable and efficient foundation for our growing platform, delivering key improvements such as:

Clearer separation of domain entities
Some level of denormalization enabling faster and more efficient queries.
Elimination of database-level event subscriptions, reducing complexity and potential bottlenecks.
A well-documented schema, making it easier for new engineers to onboard and navigate the system.

Rearchitecting the application layer

In Primer’s early days, our entire codebase lived in a single monorepo containing a few services. Among them was an extensive monolithic application handling the heavy lifting for the Processing Engine. As both the repository and service grew, working with them became increasingly challenging for several reasons:

We began losing clear bounded contexts and had a lot of tight coupling of components.
Dependency management became challenging, making it time-consuming to roll out critical upgrades.
Our CI grew overly complex, making it harder to maintain platform stability.
The separation of ownership between teams became more complex.

‍

As we transitioned to a new database, we saw an opportunity to address these challenges and dismantle our monolithic architecture. This led to the development of independent, smaller services designed to operate directly with CockroachDB and more loosely coupled and independently deployable.

One of the biggest challenges in this transition was maintaining feature parity with our legacy processing engine while ensuring a seamless migration for merchants. Everything had to work exactly as before but on a fundamentally different architecture. This was particularly complex due to the sheer number of use cases across all the different payment methods and PSPs we support.

To ensure a smooth transition, we designed a controlled migration process using feature flags to seamlessly switch merchants from the old system to the new one. We implemented a gradual, percentage-based rollout, starting with a small fraction of traffic (e.g., 1%) and progressively increasing to 100%. This approach provided an additional layer of safety, allowing us to monitor performance and address any edge cases before full adoption.

After months of work, the launch of the new system was a significant milestone that solved key scalability challenges and set the foundation for future innovations.

Data archival: balancing scalability and cost efficiency

The strategy documented above allowed us to shift all merchant traffic to the new application service and database. However, a significant amount of historical data in Amazon RDS needed to be migrated to CockroachDB.

The challenge was that new payment data was continuously generated at scale, increasing storage requirements and costs. Therefore, rather than simply migrating the data like-for-like, we approached the migration with a long-term data archival strategy.

We based our approach on the fact that payments typically require processing only for a limited period. There are occasional exceptions, such as refunds on months- or even years-old transactions, but these are rather infrequent operations on the data.

Given this access pattern, we designed a migration strategy that involved moving all legacy Amazon RDS data to Primer’s Data Archive, a dedicated system for storing historical payments. Instead of keeping all payments permanently in the primary database, our new processing engine can retrieve and rehydrate archived payments on demand from cold storage.

This means that when a merchant needs to perform an operation on an older payment—such as issuing a refund—the system dynamically retrieves the data from the Data Archive and moves it into CockroachDB for further processing.

Using this approach and separating active and archived data, we’ve efficiently migrated our historical data while also future-proofing with a scalable foundation for handling an ever-increasing volume of payments while controlling storage costs.

Scaling testing for a complex payments ecosystem

Our platform's stability and reliability are critical. Any bugs can cause revenue loss for our merchants. Therefore, we must ensure that everything we ship is thoroughly tested under real-world conditions.

Between 2022 and 2024, we made significant advancements in our testing methodology and continue to refine our approach.

One of Primer's unique challenges is its extensive third-party ecosystem. We integrate with PSPs, alternative payment methods (APMs), fraud providers, card networks, and more, each with its own APIs, behaviors, and potential points of failure. While robust unit and integration tests cover our core services, we needed a more comprehensive strategy for end-to-end testing that accounted for Primer’s internal workflows and interactions with these external services.

Back in 2022, our end-to-end test coverage was relatively small. Since then, we’ve significantly expanded our test suites, covering Primer’s standalone functionality and also the many ways merchants interact with Primer’s third-party integrations.

However, testing against external providers at scale presents several challenges:

Primer supports over 100 third-party integrations with production volume, each with its own API behavior.
Third-party services can experience downtime or incidents, causing instability in test results.
Many integrations require unique credentials and sandbox (testing) accounts, making automated testing more complex.

‍

To address these issues, we developed a smarter way of running end-to-end tests in our CI infrastructure. Rather than executing every possible third-party test on every code change, we’ve built our CI pipelines to understand the code changes made and only trigger relevant tests.

By making our end-to-end testing more intelligent and efficient, we’ve minimized strain on our CI infrastructure, third-party providers, and accelerated deployments. But, most crucially, we’ve dramatically improved our reliability, recording a 93% drop in incidents per one million payments and ensuring we have a platform merchants can depend on.

Advanced monitoring: reducing incident volume and noise

The final key area of this project to scale Primer’s core processing engine is the continuous enhancement of our monitoring system. As traffic increased, our engineering on-call workload became overwhelming, especially while running two systems in parallel.

At one point, engineers were paged more than ten times a day. Some alerts led to actual incidents, while many were false positives. This level of noise made it challenging to prioritize real issues and created unnecessary operational strain.

To address this, we focused on systematically refining our alerts and thresholds. We conducted thorough post-mortems after every incident, ensuring that each issue resulted in clear action items to improve system stability. Over time, we:

Built detailed dashboards to monitor system health in real-time.
Developed clear runbooks to streamline incident response.
Iteratively optimized alerts to reduce false positives while maintaining strong coverage.

‍

Through these efforts, we tripled the number of monitors and alerts while significantly reducing noise. As a result, escalations to the engineering team became less frequent, helping to improve the team's overall work-life balance and making it easier to identify real issues with the system.

Engineering for the future: scaling beyond today’s challenges

Everything described in this article has been crucial in scaling Primer’s processing engine to handle our massive payment growth, but this is just the beginning. Our payment volume increases every month, our use cases become more complex, and the number of integrations expands.

That’s why we approach everything we build with future scale in mind. Our focus remains on making Primer’s processing engine bulletproof, high-performance, and adaptable—ensuring it meets today’s demands and is ready for whatever we build next.

share this post

stay up to date

Keep reading

Resource hub

Solving the reconciliation gap in modern payment stacks

Reconciliation is still a blind spot in most payment stacks. In this post, Primer’s Head of Product, Theo Spyrides, shares why solving it isn’t just a financial fix but a strategic step toward truly optimized, end-to-end payment operations.

Manage