Building reliability as a product: how we kept Primer stable while doubling throughput

6 min read

In April of 2021, Primer processed its first production payment. By early 2025, we were processing a large number of payments per week, and had doubled our throughput from the previous year. At that time, we wrote about the challenges that come with scaling this type of system so quickly. 

Since then, our throughput has, once again, more than doubled in the past year.

All of the improvements we made in the past helped us support larger amounts of traffic, but we also had to solve new problems to protect the platform’s reliability, the merchant experience, and the overall stability of our system. 

In the past year alone, besides releasing several new products, we reduced the number of payment processing incidents per million payments by 50% compared to start of year. We also decreased latency by up to 50% for many of our critical flows. 

Achieving this came from a collection of initiatives designed to make our systems run like a well-oiled machine. Today, I want to take you through a few of those initiatives, and the way we think about the health of our system, in a deep dive of how we’re working to make Primer reliable for our merchants. 

The mentality: Fix things and move as fast as you can 

At Primer, we consider reliability a core product feature. 

It might not be making the largest headlines, but it is absolutely vital for merchants, and it’s a core part of why they choose Primer. Being reliable enough for others to trust us with their business is a responsibility we take seriously, and providing a high-availability service remains a top priority with everything we build at Primer. 

Software development in competitive spaces often comes with a mentality of “move fast, break things.” This is an approach to systems development born out of necessity. It’s a level of ruthlessness necessary when you’re looking for product-market fit, have a limited runway, or want to iterate very fast. 

But for a company of Primer’s maturity, that is not an option. We do want to still move as fast as possible, but we have merchants trusting us with their business, so we must act as custodians of those businesses, and not just as developers. That doesn’t mean becoming slow or indecisive, but it means that we have to flip that expression on its head to honor our commitments with our merchants. First, fix things and solve problems. Then, while doing that, move as fast as we can. And don’t break anything along the way. 

This mentality isn’t just aspirational, rather, it’s actively enforced through our engineering processes. We place a strong emphasis on incident reviews and follow through on action items, deliberately invest in making our test suites even more thorough, regularly review reliability and performance metrics, and continuously work to reduce blind spots in our monitoring and alerting. 

This commitment to reliability is at the heart of why we were able to make such meaningful improvements to our system. All of the other concrete points mentioned here would not be possible without it. 

Fault-tolerance requires constant investment 

When designing our systems, we have had to face the uncomfortable fact that third-party systems all have their own failure rates. 

We knew this already from interacting with multiple payment processors, and Primer already offers ways for merchants to deal with this risk by configuring Fallbacks in case their primary payment processor is down. But as we process more payments every day, we understand that this isn’t just a risk with payment processors. What happens when a third-party provider we rely on fails, or the merchant’s own system fails? At the limit, this is a risk that exists for every service that Primer, and by extension our merchants, depend on. 

Because of this, we have continuously invested in redundancy. Just like our merchants reduce the risk that comes with depending on a single payment processor by configuring a fallback, so too can Primer reduce the risk of impact with a similar strategy. Two excellent examples exist in our 3D Secure (3DS) and Webhook products. 

3DS 

For 3DS, we started out by relying on a single 3DS provider to process 3DS authentications. But we quickly realized that, no matter how reliable that 3DS provider’s reliability guarantees were, we were incurring risk. If that provider had an outage, this would impact merchant traffic. 

To mitigate this risk, we looked at the problem from a different angle by asking ourselves: “Let’s say the 3DS provider failed. How do we still protect our merchants?” 

Once we formulated it like this, it became obvious that redundancy was the way out. Primer then integrated support for a second 3DS provider with strong reliability guarantees, on top of our already existing one. Then, to ensure this protection was fast enough to protect merchant traffic as quickly as possible in the event of an outage, we implemented per request fallback between the two. This means that if a request to our primary 3DS provider fails for a technical reason on the provider’s side, Primer will automatically retry it against our secondary 3DS provider, recovering the flow in the event of an outage. This will happen in real time, with only the added latency of an extra HTTP request. 

Webhooks 

Webhooks are a key part of how merchants process payments, and allow many Primer’s merchants to react to things that happen within the payment’s lifecycle. 

As our traffic grew, so did the number of webhook failure modes. For a single payment, Primer may emit multiple webhook events, meaning webhook volume can scale rapidly with merchant traffic. When a merchant’s webhook receiver slows down or goes offline, delivery attempts back up, retries increase, and what would otherwise be normal inbound payment traffic can quickly translate into a much larger spike in outbound webhook load on our side. 

Much like in the case of 3DS, the way to solve this was to start from the disaster scenario, asking ourselves: “What happens when all our other defense mechanisms fail, and requests start to fail?” 

Once we framed the problem this way, a solution presented itself. We realized that besides maturing our retry strategies, optimizing how we do jitter, or anything in that vein, we also needed to ensure proper redundancy was in place. 

Because of this, we implemented multiple rails for webhooks to be processed through. Think of these like train tracks handling merchant traffic. We could spend a lot of time making sure that these tracks have the most advanced technology possible to make the trains run on time. And we have done many optimizations like this. But at the end of the day, accidents happen. We could not rely on a single track to handle all carriages, no matter how good that track was. 

Now, merchant webhooks dynamically get sent through the rails that are best suited for the characteristics of that specific request. If the merchant’s webhook receiver is struggling, our system automatically moves that traffic to different rails to protect the merchant’s traffic health and Primer’s platform reliability.

Different strategies for delivery, retries, etc. This makes webhooks more likely to be delivered.

This improvement alone has prevented several would-be incidents at Primer in the past few months. 

Both the 3DS and Webhook cases are examples that highlight a more general philosophy we’ve taken to ensure Primer’s reliability: while we are still optimizing our flows as much as possible, and trying to stop things from breaking, we are also actively considering how things will break. This kind of thinking ensures we fail gracefully and protect our merchants from even the least expected outages. 

Performance optimizations are contextual 

One of the things that we’ve also learned in the past year of making Primer more reliable is that performance optimizations are largely contextual. You could simply think “good performance” is good performance. A lower latency number is always better. 

But in practice, the tradeoffs are more subtle than they first appear. Consider a system where the most latency-efficient approach would be to write to several database tables in parallel. From a correctness standpoint, however, those updates must be performed within a single transaction: they need to succeed or fail together to avoid an inconsistent state. 

That requirement is non-negotiable in a payments system. Correctness always comes first. Correctness and reliability constraints shape the space in which performance optimizations are possible. The rule is simple: we can never break correctness for the sake of better latency. 

Problems like these get even more complicated when you consider the sheer volume of traffic that Primer is now handling. Queries we thought were well optimized started showing signs of degradation. Not because the query pattern itself was inefficient, but because the underlying data tables had grown too large in the past few years. Optimizing the queries further would have yielded diminishing returns. We needed to rearchitect parts of the system with the new throughput requirements in mind. 

Getting optimization right in a part of the system takes a carefully measured approach. In one case, one of our access patterns that was showing the most signs of degradation was optimized multiple times from a code-level, technical perspective, with meager improvements each time. Then, we introduced a Time-To-Live (TTL) on non-critical parts of that table. Doing so yielded almost half a second of improvement on average for flows using that table, far more than we’d gotten out of previous optimizations. 

In another example we’ve previously written about, we started moving old data that merchants are unlikely to need into cold storage. If merchants try to access that data, we will then rehydrate it on the fly, which will be a bit slower than before, for a very narrow set of use cases. But that tradeoff means that the vast majority of flows, accessing data that is not very old, became much faster. 

The examples above show that the best optimizations are not always low level, or about tweaking the code. To provide our merchants with the best possible experience in stability and reliability, we need to consider the full picture of their operational needs and optimize accordingly. Performance work is inherently contextual. The question is not just, “how do we make things faster?” but “how do we make the paths that matter most to merchants fast, without compromising correctness or reliability?” 

Keeping up with future demands 

Payments move faster than our team can grow. As Primer’s payment volume grows, we’ve deliberately worked to ensure that our incident rate does not grow alongside it, because if operational issues scaled proportionally with traffic, it would quickly become impossible for any team to keep up. 

That’s why reliability continues to be a first-order priority for Primer’s engineering teams. 

We build new systems with scale and failure in mind from the start, and we continuously invest in improving the stability of existing ones. Requirements evolve, traffic patterns change, and different scenarios call for different tradeoffs, but maintaining predictable, reliable behavior as the system grows is a constant goal. 

Businesses trust Primer with their payments, and that is a responsibility we take seriously. Reliability is not just a feature of our product. Reliability is the product.

Want to see how Primer handles reliability at scale? Book a call with our team.

Want to learn more KEY FACTS?

To download, please fill in your email

Stay up to date

Subscribe to get the freshest payment insights.