Building a Safety First Incident Response Process with SEV Tool
Author(s): Engineers Betty Zhao and Zero Cho.
Incident Response at Robinhood
Incident response plays an important role in maintaining platform stability by providing a standard process for responding to anything from an outage to a security incident. As a Safety First company, we are responsible for keeping your money safe and accessible, and we take that responsibility very seriously. Incidents are categorized and triaged based on their severity and risk to customer funds and information. If private customer information is at Risk, the incident is routed through a separate process. At Robinhood, we refer to incidents as SEVs — short for service event. In this post, we will talk about our incident response process and how we use SEV Tool to resolve and learn from incidents.
Previously, we talked about our post-incident review and reporting process when we were using a Google Docs and Sheets-based manual reporting system that was meeting our needs at the time. As Robinhood has grown, we’ve had to scale up our processes, automate workflows, and extend our process into incident handling and prevention.
We’re excited to share the investments we’ve made in this area, starting with creating an incident response process that aims to:
- Enable responders to resolve incidents faster
- Learn from our past experiences to inform better, more reliable decisions
- Build a culture of reliability at the company
Lifecycle of a SEV
SEVs can vary in level from minor incidents (SEV3) to significant outages (SEV0). SEV0s and SEV1s (major incidents) are declared when many customer funds and/or the firm are at risk, whereas minor incidents may affect a smaller subset of customers or infrequently used external features.
Incident response comprises several phases: detection, notification, response, mitigation, and analysis. We have an entirely blameless SEV culture that encourages Hoodies across all functions to participate in the response at any point in the lifecycle.
Detection: The incident response begins when the incident has been detected through automated alerting, customer feedback, or employee reports.
- Notification: At this point in the process, service owners are informed of a potential issue. An initial investigation occurs to determine the scope and impact of the issue.
- Response: The incident has been declared as a SEV, and the work to remediate the issue begins. An Air Traffic Controller (ATC) acts as the incident coordinator to organize the response and pull the right people into the response.
- Mitigation: Once a potential solution has been identified, our responders work to mitigate the SEV and remediate the issue.
- Analysis: After the SEV has been mitigated, the responders identify SEV Corrective Actions (SCAs) and write up a SEV Report with the technical details. We have weekly SEV Reviews to go over completed SEV Reports to learn from the incident, with the goal of using this knowledge to prevent future SEVs.
Introducing SEV Tool
SEV Tool is the internal tool at Robinhood for incident response. It is a part of our SRE (Site Reliability Engineering)-owned Incident Response Suite, which encompasses all of our incident response and oncall tooling. It is integrated with Slack, Jira, Google Drive, and Opsgenie to facilitate coordination and prevention.
The goals of SEV Tool are to:
- Support the entire end-to-end incident response lifecycle
- Implement best practices for reliability
- Reduce toil and human error by automating manual processes
- Collect metrics to track SEVs
Building the tool in-house gives us greater control and flexibility over the shape of our incident response process. It also enables a tighter feedback loop and higher engagement with our internal users, thus reinforcing our goal of getting everyone to participate in building a culture of reliability.
Mitigating the Incident
When a SEV occurs, employees use the SEV Tool to file a SEV. The intake form is deliberately kept sparse to minimize friction when opening a SEV — the details can be filled in later.
The SEV Tool Slackbot announces newly-created SEVs in a sevs Slack channel and creates two Slack channels for each SEV — one for announcements and one for the response. SEV status updates are automatically cross-posted from the response channel to the announcements channel.
Responders join the response channel to work on remediating the SEV. Air Traffic Controllers are automatically paged via Opsgenie and added to the response channel. Within the response channel, SEV Tool provides a variety of custom Slack emoji commands to facilitate creating SCAs, assigning SEV report owners, and mitigating the SEV. Slack’s very complete emoji feature set provides a convenient way to interact with a service like SEV Tool: you can see the full name of a custom emoji by mousing over it, discover the name of an emoji through autocomplete, and use them in text or as reactions, etc.
Learning from Past SEVs
SEV Corrective Actions are tracked as Jira tickets with a dedicated assignee, categorized as ‘Must Do’, ‘Should Do’, and ‘Could Do’, in order of highest to lowest priority. SCAs are meant to be preventative; we want to make sure that we continuously harden our systems and ensure that the same incident does not occur again.
SEV Reports are Google Docs generated from a standard template. Each report includes a technical analysis and a discussion on the impact and lessons learned. Completed reports are posted to our sev-reports Slack channel with full visibility to all employees, and a select few reports are chosen for further discussion during our weekly SEV Review meetings.
All past and ongoing SEVs can be viewed through the SEV Tool web UI. SEV Tool periodically exports all SEV data to our data lake, which powers our Looker dashboards. Many of these dashboards provide a high-level overview of SEVs by level and service. For more detailed analysis, all our SEV data is presented in a spreadsheet format in the SEV Repository dashboard.
SEV Tool Architecture
The diagram below outlines the architecture of our system. Our client, server, and Slack event listeners run on Kubernetes pods through Robinhood’s Archetype Framework.
We rely on third-party integrations to execute our incident response process. Our most extensive integration is with Slack; our Slackbot is a key part of our incident response workflow. We use websocket connections to listen for incoming Slack messages, reactions, and commands.
When a SEV is filed, the server calls the Slack API to create Slack channels and post messages, and we notify responders by paging them via the Opsgenie API. Throughout the response, responders file corrective actions using our Slackbot and Jira integration, creating issues directly in teams’ project boards for triaging. Finally, when the SEV has been mitigated, a SEV report owner is assigned via a Slack emoji command, and we use the Google Drive API to create a new Google Doc based on our SEV report template.
We retry actions wherever possible to account for any of our third-party integrations going down. It is imperative that we can still proceed with a SEV response when this occurs. All of the above interactions can be performed through our web UI or the Slackbot; we aim to have parity between the two interfaces so that SEV Tool still works when Slack is down.
In the future, we want to work on expanding SEV Tool to automate the detection and notification aspects of incident response. This includes pulling in alerts from our Alertmanager to provide an escalation path from alerts to SEVs and automatically paging and adding relevant service owners to SEVs as they are created.
If you’re interested in shaping the future of reliability and incident response at Robinhood, consider joining our SRE team!
All trademarks, logos and brands are the property of their respective owners.
Robinhood Markets Inc. and Medium are separate and unique companies and are not responsible for one another’s views or services.
© 2022 Robinhood Markets, Inc.