Part I: Scaling Robinhood Clearing Accounting
Since 2018, Robinhood has been self-clearing under Robinhood Securities — a clearing firm that exists solely to support Robinhood customers. With self-clearing, we no longer rely on third parties to reconcile and settle our customers’ trades. Self-clearing allowed us to eliminate fees that our former clearing firm charged, achieve 24/7 application approval, and provide a better customer experience around statements and trade confirmations.
Why is having our own clearing operations good for our customers?
After a customer places a trade on Robinhood, it will be routed to and executed by one of our partnered market maker venues. After the execution, a clearing process is needed to reconcile Robinhood’s view of the trade against that of the trade’s counterparty, and eventually settle the assets after two days (or one day in the case of option trading). More details about clearing and settlement can be found in this support article. At the center of any clearing firm is its accounting system; the system that keeps accounting up-to-date has to be robust. If accounting ledgers are stale, downstream operations like trade reconciliation or settlement will be blocked. This may in turn prevent the clearing firm from being able to fulfill its obligation to report to authorities like the Depository Trust and Clearing Corporation (DTCC) in a timely manner, since we will not be able to match our records with our counterparties’ if our accounting book is not up to date.
The problem
The accounting system’s initial implementation was a set of tables in a single Postgres database, and homogeneous Kafka consumers that updated those tables synchronously. It was done as a gigantic transaction with heavy locking to prevent race conditions between multiple consumers. It worked when we first launched back in 2018, but as our customer base grew, we started to see the accounting system lagging behind the ever-increasing volume.
Throughout 2020, our growth accelerated, and as a result we focused on scaling our original accounting system to meet the needs of the business. The system continued to function without negative impacts to our customers, but it was clear to us that fixing reactively was not a permanent solution, and it was time to rebuild the system with scalability as a design principle. This won’t be an easy task as the clearing domain is intrinsically complex with dozens of responsibilities yet demanding in terms of correctness.
Finding the asynchronous solution
First off, the clearing accounting system isn’t on the critical path of order placement by customers, so its latency will not be perceived by the customers. It will, however, be perceived by our clearing operation teams.
Both the computation and the storage of our old system were monolithic, so in order for our new accounting system to be truly scalable, we needed to tackle both problems. Here is how we broke down the computational monolith.
The following graph illustrates the responsibilities of the Kafka consumers, the only computational component in the old accounting system.
The graph shows a parallelism of two but in reality the consumer group had a much higher parallelism, equaling the number of partitions of its upstream Kafka topic. There are only two ways we can further increase the capacity of this system:
- Increase the parallelism even more.
- Make the total time shorter to process a message.
We decided to first take a step back and review the requirements of the accounting system. Under the current implementation, all steps were stuffed in a synchronous transaction for simplicity, so that developers were able to move fast and not worry about more granular consistency/atomicity requirements. With some due diligence, we were able to identify that:
- Executions don’t have to be updated atomically with ledger entries and account balances.
- Ledger entries do not have to be updated atomically with account balances either. They only need to be eventually consistent with account balances at the end of the day.
- Customer balances updates don’t have to be atomic with firm balance updates.
These observations showed us that it’s possible to make processing faster by making what used to run synchronously asynchronous. Asynchrony provides a repartition opportunity to aggregate updates to the same firm account, and therefore allows us to write the aggregated updates to the database without locks.
We reimagined our accounting system as a pipeline of asynchronous processes, and the following diagram illustrates what it looks like.
The capacity of this new system is highly dependent on the speed each process can publish messages to its downstreams. Each consumer, both writes to a Postgres database and produces some Kafka topics. Even though absolute atomicity between the two actions is not required, we need each consumed message to have its database side effect written idempotently and its offspring messages produced at least once, if and only if, the DB side effects are visible. We can’t include message production inside the Postgres transaction, because even if we place it at the very end, there’s always a chance that the DB commit fails. In such cases, the messages would’ve been sent, but the DB side effects are not visible yet. Traditionally, such requirements are fulfilled with the generic paradigm below:
- Write the messages that are supposed to be produced into Postgres under the same transaction with other side effects, with a flag for each message indicating that it’s not sent yet and an ordering key.
- Have a separate process to scan the message table, which uses the metadata discussed above to decide which messages to send and in what order, send such messages to their corresponding Kafka topics, and update the metadata so that those messages won’t be sent again when the next scanning happens.
This paradigm worked well for applications whose rate of messages is low. For applications like the accounting system, this becomes a bottleneck because of the necessity of scanning and updating a rapidly-growing table. We realized that we should take advantage of two facts in our case:
- Processing happens within Kafka consumers, where Kafka logs and committed offsets provide a natural retry mechanism.
- Deduplication is done via a lookup table in our Postgres DB that is also updated within the same transaction as other DB side effects.
Specifically, we can do the following to eliminate the scan-update paradigm while still maintain the consistency requirements:
- Consume Kafka messages
- Process messages and commit DB transaction
- Produce messages to downstream Kafka topics
- Commit offsets for the consume messages
If something goes wrong in step 3, the Kafka consumer would know where to pick up since the committed offsets would still be the old ones, and the re-processing of those messages would be no-op because the lookup table would’ve already been updated in the previous run. This change has cut down the average time to produce a regular-sized message from ~20ms to less than 0.1ms. Production to Kafka is no longer a bottleneck for the accounting system.
After this massive rework, we increased our accounting system’s capacity by more than 10x in terms of message throughput. The improvement hasn’t shown a single sign of lagging since this effort was complete.
Lessons learned
At this point, we’ve only finished half of the puzzle. The accounting system was still bound by the storage bottleneck no matter how optimized our computational layer was. With Robinhood’s growth rate, we could have come up against that challenge at any time — we were on the clock. In part 2, we talk about how we addressed those limitations, and horizontally scaled our accounting system from end to end.
Robinhood Securities, LLC (member SIPC), is a registered broker dealer and provides brokerage clearing services.
It is a wholly-owned subsidiary of Robinhood Markets, Inc. (“Robinhood”). © 2022 Robinhood
Robinhood and Medium are separate and unique companies and are not responsible for one another’s views or services.