Creating a SEV process that scales with Robinhood
Creating a SEV process that scales with Robinhood
At Robinhood, we take money seriously. We never want to prevent our customers from managing their investments or accessing financial information. In our efforts to prevent outages, we are fortunate to follow the trail blazed by earnest engineers at successful companies. We have learned a lot from resources like Etsy’s “Blameless Postmortems” and Google’s SRE Resources. In this post, we want to share some of our experiences as we get better at responding to and learning from incidents.
Starting SEV Reviews
From the earliest days at Robinhood, when things went wrong in production we would write up a “SEV Report” and post a link to it in a dedicated Slack channel. This worked well when engineering was small enough that each engineer was familiar with just about every system. As we scaled, we found that Slack was great for visibility, but it didn’t promote robust discussion. The audience for the discussion around a SEV could be unclear, and the discussion was sometimes hard to follow for someone lacking context. To try to address this, we created a weekly meeting where we would review that week’s SEVs.
This new meeting — which we call SEV Review — wasn’t an immediate success. We’d get together and spend a lot of time talking about Postgres VACUUM ANALYZE or exactly how to handle the application of a hotfix to a release branch, but often fail to come to a conclusion or record the follow-ups. Because we viewed these incidents primarily from the vantage of Engineering, we were mostly focused on the root cause and didn’t spend much time talking about the response. Because we weren’t always consistent in our SEV reporting, we were missing some opportunities to learn from these events.
We set out to make SEV Reviews better. We standardized our criteria for categorizing production incidents as SEVs. We got better at identifying and drilling into the important details without rat-holing on them. We started to learn to look at how SEVs affected the whole company, not just Engineering, and invited teams beyond engineering to SEV Review meetings, such as our Operations and Customer Experience (CX) teams. After a couple of months, our investment in this process paid off. We improved our tooling and processes in ways we know have prevented subsequent incidents. It also led to a practice we think is fun and somewhat unique, that we call Incident Response Drills.
Practice makes perfect
SEV Reviews are both about learning how to prevent issues and mitigate them when they do happen. We knew that if we could speed up and improve our response, we would lessen the impact of issues when they did happen. As a first pass, we drafted an Incident Response Protocol that outlined what we thought might be some best practices when responding to a SEV. As with any engineering solution, we wanted to test and iterate before introducing an unproven process in the context of a real incident. So we scheduled a drill.
We gathered a small drill-planning team and brainstormed a list of issues that might come up in production. Using these ideas, we built a scenario that would allow us to test our Incident Response Protocol. Over the course of a couple of weeks leading up to the drill, we worked to answer questions that would make the scenario more realistic and the response more interesting: How would the incident first be noticed? What metrics would be affected that would be detectable with monitoring? What did we think was the best way to resolve the scenario? What data could we add to make the scenario more detailed?
As with anything new, it turned out that our original Incident Response Protocol needed some refinement. But a series of drills (and even some real SEVs) gave us the opportunity to iterate on it, each time reviewing the effectiveness of these updates in SEV Review. By focusing increasingly on our response, we found ways to improve it. This included identifying ways to work together better, like formalizing roles within the response team. We also set new norms for the use of Slack, Google Hangouts, and conference rooms that made it easier for the response team to work efficiently.
As we learned to collaborate through these drills, we developed more empathy for and understanding of the important ways that every team at Robinhood works to respond to issues that could affect our customers. Now we routinely come together as one team to resolve problems, and just about every department at Robinhood participates in SEV Review. Judging by attendance, it’s actually become everyone’s favorite meeting of the week. We are proud that we are learning through our response to these issues, because we know that’s how we grow and become stronger.
A SEV Review for the SEV Process
As you can see, our SEV process has matured quite a bit since the early days here at Robinhood. It has been incredibly impactful and popular at Robinhood and it’s been a wonderful institution since the very early stages of the company. But, we’ve also grown immensely over the past several years. With the addition of new teams and team members, as well as new products and services that are used by millions of customers, we started to notice cracks in the seams as the organization scaled rapidly. To help prevent breakdowns in the process and ensure we could effectively continue to track the status of SEVs and the corrective actions we put in place, we needed to do some refining. We also needed to make sure all the new people joining Robinhood could effectively apply the process and we were sharing institutional knowledge of best practices and learnings more broadly.
We knew that we could continue to improve our SEV process just as SEVs themselves improve our engineering and operations. So, we asked for feedback from cross-functional teams involved in all phases of the SEV process and we identified the following major areas of improvement.
Incident response coordination
During incident response, we found it was hard to muster all the right on-call engineers and operations teams via Slack and quickly get them “caught up” on the current state of the investigation and response. To help, we created an internal Slackbot named “SEVbot” that allows non-technical folks to quickly page the appropriate teams and append updates to a succinct incident response log. There’s even a “mayday” feature that pages the on-call engineer from every team.
In the past, we did reasonably well following up on “short-term” corrective actions, i.e. things like adding new alerts, introducing canary deployments or automating manual processes. However, many SEVs identified “long-term” corrective actions that typically required major system updates or changes to architectural patterns. We knew these actions would be highly impactful, but we needed a better process for prioritizing and following up on them. To address this gap, we introduced SEV Retrospective meetings where we extract themes from SEVs over the past few months and then globally plan, prioritize, and track progress on the most impactful long-term projects.
The original SEV process grew organically and intrinsically just became part of our culture, but only parts of the process were documented. This led to inconsistent application of the process as people asked things like:
- How do I track and communicate the status of a SEV?
- Do I have to write the SEV report for minor SEVs?
- What qualifies as a corrective action? How long do I have to implement them?
To address this, we clearly defined the process with “RFC-style” language that identified the timeline for SEVs and resulting actions (e.g. the required time frame for a response, as well as the corrective action and implementation time frame). We vetted this process definition and gathered feedback across teams to ensure it was feasible and delivered our desired outcomes. We feel confident that process clarity and consistency will help us scale as we onboard so many new cats here at Robinhood.
While we’re really proud of our SEV process and the work that goes into continually improving our systems and processes, we didn’t have a quantitative way to assess whether the process is truly delivering value to the business. To address this, we’ve annotated all historical and new SEVs with rich metadata, including SEV level, product track, time to detect, time to resolve, and hierarchical tags detailing things like root cause and affected areas/systems. We use this metadata to build reports that help us analyze trends and answer business-level questions like “Which product tracks had the most SEVs in July?”, “Which systems are causing the most SEVs?”, or “Is the overall number of SEV1 incidents declining over the last 3 months?” Answers to these questions help us prioritize areas of investment and we believe these proxy metrics will help us better understand if our investments are indeed delivering business value.
Continuing to refine
We have shown that hard work in defining, implementing, and refining our SEV process has helped us continue to make our systems more reliable. But we also know that no matter how much we improve, we will never be able to prevent every production issue. Complex systems can break in surprising ways. Our SEV Process is all about embracing this fact by treating each of these moments as a valuable opportunity to refine our systems for detection, response, and prevention. By continually investing in the process itself, we will strengthen our company as it scales.
If you’re interested in joining us as we continue to scale our systems and ultimately help more people participate in our financial system, come join us at Robinhood!