Building a Resilient Card Transaction System
Building a Resilient Card Transaction System
Author Stephen Chang is an Engineering Manager at Robinhood working on Payments.
In March 2022, Robinhood launched the new Cash Card, a debit card that helps people invest as they spend! Debit cards are incredibly popular and common methods of payment. They are responsible for over half of all card payments made in the US . But what exactly does a debit card do?
Fundamentally, a debit card is a financial tool that allows you to easily access money from your account at any time. All you need to do is present your card to a merchant and within seconds you can walk away with your goods, services, or even cash back! Simple right? Well, behind the scenes is a complex ecosystem of merchants, payment networks, card processors, issuing banks, card program managers, and more. All of these parties need to work together to determine whether the merchant should process the payment or not: is this a valid card, was it reported stolen, does the account have enough funds, etc. Most importantly, this entire ecosystem must be available and running 24/7/365. Debit cards are often used as payment for critical everyday transactions like groceries and gas. Any system downtime could result in customers being stranded without access to their money. How then, can we build a resilient card transaction system?
What exactly is a debit card transaction?
First let’s ask: what exactly is a debit card transaction? In its essence, a debit card transaction is comprised of two parts:
- Authorization request — “can this customer spend $XX.XX”
- Authorization response — “yes they can” or “no they cannot”
The authorization (auth) request is a simple question generated by a merchant’s Point-of-Sale (POS) terminal. Using the card number as an identifier, the merchant sends the request across the appropriate payment network (Mastercard) to our debit card processor Galileo before it makes it to Robinhood. Within this request is the information about how much money the customer is attempting to spend in this transaction as well as a variety of metadata about the merchant and the transaction itself (merchant information, magnetic card swipe vs chip vs contactless payment, etc.) Using this information, our systems can decide whether the transaction should be authorized or declined. After this decision is made, the response is sent back to our processor, across the network, and back to the merchant.
- Card swipe at the Merchant Point-of-Sale (POS) system
- Auth request transmitted across the Mastercard network
- Auth request is processed by our payments processor Galileo
- Auth request is forwarded to Robinhood
- Auth decision is made
- Auth response is sent back to Galileo, across the network, and back to the merchant
- Transaction is approved or declined at the merchant
This entire flow must also be highly reliable with a tight time service-level agreement. Specifically, Robinhood has 2 seconds to respond to our payments processor (steps 3–6 in the above diagram). If the auth decision is not made in under 2 seconds, the transaction is automatically declined. This time requirement applies around the clock meaning we cannot support even regularly scheduled downtime during maintenance windows. This is a 24/7/365 service that must serve critical traffic with a tight SLA. Any downtime, degraded system performance, or even regularly scheduled maintenance will result in customers being unable to use their cards to make transactions.
Diving into the authorization decision
Debit card authorization is complex and has many different components to it. First, there is the base functionality: receipt of the auth request through a webhook from the processor and parsing of the metadata. Once we have the metadata, we can start the authorization process. Some of the components of authorization include:
- Identification of the customer and account
- Checking if the account is in good standing — account is active, unrestricted, etc.
- Checking if their debit card is available for use — card is valid, not reported lost stolen or damaged, activated, not locked, etc.
- Checking if the transaction is supported — some currencies, countries, merchants, or types of transactions may not be supported on Robinhood (see help center for more details)
- Various risk and fraud checks — Is this transaction suspicious or fraudulent?
- Balance checks — how much cash is available to spend?
Adding to these evaluation criteria are the various system complexities introduced by our system architecture. Robinhood’s systems are distributed and there are numerous different services involved in the authorization flow. This is especially true for Robinhood’s legacy debit card on Cash Management accounts. Legacy debit card transaction authorizations need to interact with portfolio balances, margin requirements, equity and option trades, and much more. Every additional service that is introduced into the authorization flow also adds additional complexity, increasing risk of unexpected errors, failures, and latency issues. There are networking issues as well as various infrastructure concerns with resource availability and database load associated with each additional service. Not to mention the various maintenance windows for each service.
With so many moving parts, technical issues can quickly compound. Any downtime or degraded performance in any of the services could potentially bring down debit card authorization.
Building a resilient system
So how then do we build a resilient system that is robust and maximizes our ability to serve auth traffic reliably? Our solution is to build a second, lightweight backup system that can serve as a stand-in service to handle all traffic when the primary system is degraded or down. To reduce the chances of both services going down at the same time, the backup system is built with a different architecture, tech stack, languages, infrastructure, and even separate deployment schedules.
The most significant difference in these two systems is in the architecture. The core service has a “pull” based architecture, where the most up-to-date information is queried on demand each time an authorization request comes in. We query the account status, restriction status, the latest account balance, any ongoing fraud trends, etc. from all the necessary downstream dependencies and databases. On the other hand, the backup service is a “push” based architecture, caching the latest state of each cardholder’s account by subscribing to asynchronous updates broadcast over Kafka streams. This architecture allows the auth lookup decision to be much faster and lightweight.
On an infrastructure level, these two services also have two completely separate databases. This further helps isolate potential problems that arise from bad migrations, slow queries, and generally any type of degradation that can affect database performance.
Putting it all together
Now we have two separate systems making two separate decisions on authorization of each transaction auth request. How do we decide when to use the primary vs the secondary auth decision?
Here, we employ a circuit breaker design. The backup system sits in front of the primary service, serving as a pass through server between our card processor and our primary auth service. Both systems perform their own individual auth decision in parallel and set a timer ahead of our strict 2 second SLA. At the end of the time allowance, we compare both results.
- Receive primary and secondary decision -> Use primary decision
- Receive primary and NOT secondary decision -> Use primary decision
- Receive secondary and NOT primary -> Use secondary decision
- Do not receive primary or secondary decision -> True outage, no decision, automatic decline
As you can see, we have preference to the primary auth decision from the core service. The only time we will use the secondary backup decision is when a valid decision is not received from the core auth service within the SLA. This could be due to full system outages, degraded performance, a specific bug in our core logic, or anything that causes an error or a timeout.
Break glass scenarios
We also have a few contingencies that can allow us to short circuit either the primary or the backup systems completely on demand. We have mentioned that the backup system is more lightweight than the primary auth service, but this does not mean it is impossible to encounter downtime. In this scenario, we can route all requests directly to the primary service until we can fix the backup. Alternatively, we can temporarily force the backup decision to stand in for every auth decision and prevent auth requests from being routed to the primary service at all. We have employed both these options in the past when we were performing major database upgrades to both the primary and secondary services. By temporarily routing all traffic to one service, we can perform maintenance on the other while still maintaining uptime for the system as a whole.
Maintaining two redundant services with different stacks does come with its challenges. New features and improvements may need to be implemented twice. Engineers need to be fluent in two different languages. But while there are downsides to maintaining two redundant services with different stacks, we find that the benefits outweigh the disadvantages. By isolating the architecture, infrastructure, and even the language in which it is written, we minimize the risk of commingling code and introducing the same bug into both systems at the same time. The combined result is a system that is more resilient and reliable than either individual component can be on its own.
The Robinhood Money spending account is offered through Robinhood Money, LLC (“RHY”)
(NMLS ID: 1990968), a licensed money transmitter. The Robinhood Cash Card is a prepaid card issued by Sutton Bank, member FDIC, pursuant to license from Mastercard®. Cash Management is offered by Robinhood Financial LLC (Member SIPC), a registered broker dealer. Both are wholly-owned subsidiaries of Robinhood Markets, Inc. (“RHM”).
Robinhood Markets Inc. and Medium are separate and unique companies and are not responsible for one another’s views or services.
All trademarks, logos and brands are the property of their respective owners.
© 2022 Robinhood Markets, Inc.