Building an Application Deployment Platform on Kubernetes
Building an Application Deployment Platform on Kubernetes
How we’re reshaping our infrastructure while decreasing complexity
Behind the simple interface of Robinhood lies a complex web of microservices, all working together to provide a seamless customer experience. As Robinhood has grown, this web of microservices has grown as well, requiring us to revisit the infrastructure and tools we use to provision and operate these microservices. In this post, we’ll talk a little about why we’re embracing Kubernetes to tackle these challenges, share some stories from our experience onboarding applications onto Kubernetes, and discuss the platform we built to manage and standardize our Kubernetes-powered applications, called the Archetype Framework.
Why Kubernetes?
Kubernetes and the surrounding ecosystem have experienced a massive surge in popularity over the last few years — but this alone is not a sufficient reason for a company to embark on a major infrastructure overhaul. Our foray into Kubernetes was dictated by the problems we were seeking to solve rather than any one technology solution.
Historically, we have used a combination of Terraform and SaltStack to manage our AWS infrastructure. While this combination of technologies has carried us quite far (from our early days to over six million Robinhood accounts and dozens of microservices), our growth has experienced some technical challenges along the way. Most notably, deployments could be non-deterministic based on how the Salt states were written, and applying the Salt states across the hosts for our larger microservices could be time-intensive. It also gradually became clear that the interface we had for provisioning microservice infrastructure could be improved to better serve and streamline workflows for application owners. In particular, we wanted to create a user-centered interface that best serves application developers and the abstractions they’re familiar with.
Switching to Kubernetes seemed like a no-brainer. Moving toward containerization and container orchestration not only aligned with a company focus on building for the long-term, but also enabled us to solve the technical challenges we were facing around deployment. Furthermore, Kubernetes supports promising application-oriented abstractions such as Deployments, and provides a solid structure for extending these abstractions through CustomResourceDefinitions and custom controllers. Additionally, its API-first approach makes it much easier to interact with dynamically, compared to Salt.
While all these factors created a sense of optimism around Kubernetes, we still needed to vet it in a disciplined way to see if it would be a magical, out-of-the-box solution to the challenges we identified (Spoiler Alert: It wasn’t).
Where did we start?
Conducting high reliability infrastructure migrations is no easy feat, so our first objective was defining a restrained plan for our Kubernetes investigation. This involved descoping GKE (we didn’t want our foray into Kubernetes to require going multicloud), assessing EKS and kops, conducting internal experimentation and proof-of-concepts, and more (this initial work could be another whole post on its own).
Ultimately, we decided to gradually migrate a single application’s microservices from Salt and EC2 to Kubernetes. This effort ended up being a multi-month process (which could be yet another blog post). Once the migration was complete, we had to evaluate whether we actually moved the needle on the problems we were seeking to solve. We saw speed improvement — roughly 2x improvements in our deployment speeds — and our servers could automatically scale out much more quickly. We also gained confidence in the consistency and immutability of our deployments, with the application container image as our source of truth.
On the other hand, we had unwittingly replaced thousands of lines of Salt config YAML with thousands of lines of Kubernetes manifest YAML. The complexity from Salt remained, though in a slightly different form. Salt states for setting up common tooling — Consul agents, Vault integrations, nginx configs, Prometheus exporters, and more — had morphed into cryptic annotations, init containers, and sidecars. Raw Kubernetes manifests on their own, while functional, failed to sufficiently simplify the interface for provisioning microservice infrastructure.
How do we manage complexity?
After running our first application natively on Kubernetes, we were excited by the improvements we saw, but also surprised to find nearly the same amount of YAML configuration as in our previous stack. Upon further investigation, we realized that much of this complexity was being housed in the manifests; the Kubernetes abstractions, while generally applicable, lacked specific context on how to run typical applications at Robinhood. Taking a step back, we mapped our current microservice stack onto declarative models and defined them clearly with three key concepts:
- Archetype: An archetype defines the standardized structure for an application, from the cookie-cutters and CI jobs used for development, through the infrastructure patterns used for credential management, service registration, and more.
- Application: An application refers to a microservice in our ecosystem.
- Component: An application consists of multiple components that work cohesively to offer a service level agreement to other services in the ecosystem. Web servers, Airflow workers, and Kafka daemons are examples of components.
After conducting this exercise and defining the declarative models that came out of it, we explored means to achieve our overarching goal of abstracting away the complexity of provisioning and operating infrastructure behind these key models. We wanted a solution that would enable us to:
- Empower application developers to manage the entire lifecycle of their applications through clear, application-centric abstractions that removed the need for significant expertise in Kubernetes or aspects of Robinhood’s infrastructure.
- Enable transparent upgrades and rearchitecting of sidecars and supporting infrastructure with minimal impact (e.g., switching our service mesh should be transparent to application developers).
- Create a simple, standardized deployment process, with built-in support for ordered rollouts, canaries, and application-level health checks.
- Make replicating applications across environments easy with minimal overhead to application developers, paving the way for more sophisticated CI/CD pipelines.
- Contribute back to the Kubernetes community.
We started by surveying the wealth of amazing open source solutions that try to achieve these goals. While there were existing solutions that achieved some of these goals, none of them achieved them all. Powerful client-side templating tools such as Kustomize provided ways to simplify manifest files, but didn’t allow for new application-centric abstractions. Helm had additional powerful server-side “templating” using Charts and the Tiller, but lacked support for orchestrating updates to the generated resources and raised concerns about how the Tiller’s required privileges would mesh with a multi-tenant cluster. Jenkins X had some really interesting capabilities around scaffolding and orchestration, but we wanted Robinhood-specific customizations to be represented as first-class objects as opposed to just new commands.
Though we drew inspiration from many of the objects mentioned above, we opted to build our own platform, the Archetype Framework, to best achieve our goals.
How does it work?
There are four key components to our Archetype Framework.
1. Custom Resource Definitions (CRDs)
Kubernetes CRDs are a powerful way to extend the Kubernetes APIs, providing a way to define new API groups and resources while still being able to leverage the same API machinery and tooling (AuthN, AuthZ, admission, kubectl, SDKs) that is available to native Kubernetes resources. We created four new abstractions: Archetype, Application, Component, and VersionedArchetype (an immutable point-in-time snapshot of an Archetype). We also used Kubernetes’ codegen ability to generate Golang client libraries for these new APIs.
2. Admission webhooks
Kubernetes admission webhooks provide a way to perform custom validations and mutations on API requests, prior to objects being persisted in etcd. We built a single admission webhook server consisting of multiple admission plugins that work together to validate and mutate our custom resources and the relationships between them.
3. Custom controllers
Controllers are arguably the lifeblood of Kubernetes, responsible for moving the current state of the world to the desired state of the world. We built a custom controller, spinning off multiple control loops to realize the Application and Component objects using native Kubernetes resources.
4. Template rendering engine
Perhaps the most important component in the Archetype Framework, the template rendering engine translates user-created Application and Component objects to Kubernetes Deployments, Network Policies, ConfigMaps, Jobs, ServiceAccounts, AWS resources, and more. By capturing the templates themselves in the Archetype and VersionedArchetype objects in the API server, our custom control loops require no logic to be aware of the underlying Kubernetes objects used to realize Applications and Components — from their perspective, they simply render and apply templated objects.
Let’s look at an example of what our custom resources look like and how they come to life.
Archetypes and VersionedArchetypes are created and managed by framework administrators (application developers should not need to know how they work). These objects live in the Kubernetes API server and hold the templates that define how to realize a particular Component for an Application. Application developers can browse through the list of supported Archetypes simply using kubectl
:
➜ ~ kubectl get archetypes
NAME AGE
django 30d
golang 30d
generic 30d
An Archetype looks something like this:
apiVersion: apps.robinhood.com/v1alpha1
kind: Archetype
metadata:
name: django
spec:
currentVersion: django-0.1.1
description: Robinhood's Django stack
owner: platform@robinhood.com
While most of the Archetype fields are metadata, it also references a VersionedArchetype, where the actual templates are stored. Just as before, users (mostly framework administrators) can discover all the available VersionedArchetypes using kubectl
:
➜ ~ kubectl get vat
NAME AGE
django-0.1.0 30d
django-0.1.1 17d
django-0.1.2 10d
The VersionedArchetypes help us to roll out changes to Archetypes gradually. We can roll out a few Applications on the new version, before making that version the default version for the Archetype. A sample VersionedArchetype looks something like the following:
kind: VersionedArchetype
apiVersion: apps.robinhood.com/v1alpha1
metadata:
name: django-0.1.2
spec:
componentTypes:
- name: server
templates:
- name: serviceaccount
kind: ServiceAccount
apiGroup: v1
template: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: [[ .Application.Name ]]-[[ .Component.Spec.Type ]]
namespace: [[ .Component.Namespace ]]
- name: deployment
kind: Deployment
apiGroup: apps/v1
template: |
...
- name: daemon
templates:
- ...
...
These templates can contain any objects that can be applied to the API Server. The template engine is designed to work with Golang templates by default, but is extensible with other templating engines like Helm and Kustomize.
Once an Archetype and a VersionedArchetype object exist, application developers can start onboarding their microservices to the framework by creating Application and Component objects. These look somewhat like the following:
kind: Application
apiVersion: apps.robinhood.com/v1alpha1
metadata:
name: myapp
namespace: myapp
spec:
owners: myapp@robinhood.com
archetype:
name: django
version: django-0.1.2
version: 1.2.3 # This is the application version
containerImageRepo: amazon.ecr.url/myapp
componentRolloutOrder: # This defines the order of rolling out new
# app versions
- canary
- '...' # Wild card indicating all remaining Components can be
# deployed after the canary is deployed and passing health
# checks
alertConfig:
slackNotify: "myapp-slack"
opsgenieNotify: "myapp-pager"---kind: Component
apiVersion: apps.robinhood.com/v1alpha1
metadata:
name: api-server
namespace: myapp
spec:
application: myapp
type: server
serverConfig:
allowedHosts:
- ...
autoscalingPolicy:
minReplicas: 120
maxReplicas: 200
targetCPUUtilizationPercentage: 60
schedules:
- name: "market-open"
schedule: "00 12 * * 1,2,3,4,5"
minReplicas: 120
- name: "market-close"
schedule: "05 23 * * 1,2,3,4,5"
minReplicas: 20
The Archetype, Application, and Component objects come together to translate our custom application-centric abstractions in a set of native Kubernetes objects, realized through our admission webhooks, custom controllers, template rendering engine. Application developers can interact with our custom objects through kubectl
(and soon a UI), or they can look under the hood (pun intended) to see the native Kubernetes objects created on their behalf.
➜ ~ kubectl get apps -n myapp
NAME COMPONENTS READY
myapp 1 1~ kubectl get components -n myapp
NAME READY
api-server True# Looking under the hood➜ ~ kubectl get deployments -n myapp
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
api-server 200 200 200 200 3d~ kubectl get hpa -n myapp
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
api-server deployment/api-server 6%/60% 1 5 1 3d➜ ~ kubectl get pods -n myapp -l apps.robinhood.com/component-type=server
NAME READY STATUS RESTARTS AGE
api-server-5749655f95-58tdt 8/8 Running 0 3d
api-server-market-close-1570835100-txkb8 0/2 Completed 0 3d
...
Transitioning to this mechanism has been incredibly valuable to our team. We’ve abstracted away platform- and infrastructure-level complexities for application developers with a streamlined and minimalistic application-centric interface focused specifically on the Applications and Components they’re working with. This simplicity helps us achieve greater application developer velocity and ownership, enabling application developers to manage their infrastructure without needing to become experts on Kubernetes or every other aspect of Robinhood’s various infrastructure systems.
Here’s a diagram that summarizes how the various parts of the Archetype Framework work together:
What’s next?
The Archetype Framework currently powers about ten applications at Robinhood. We’ve only scratched the surface of what we’re hoping to achieve and the impact we believe this can have for application developers. There are two major ways we’re hoping to make the Archetype Framework more useful: broadening its surface area and building tools on top of it.
Broadening the surface area
So far, we’ve only created two primary archetypes (Django and Golang) for powering our two most popular application stacks. While this enables most of our applications to onboard onto the framework, we still have work to do to enable all Robinhood applications to come on board. Furthermore, the scope of the Archetype Framework is currently focused on infrastructure that runs application code, but we want to create similar application-centric abstractions for other pieces of infrastructure such as load balancer layers, message brokers, databases, and caches. We want to empower our application developers to manage all the infrastructure they need through the same declarative interface.
Higher-level tooling
While we’ve seen significant improvement through the Archetype Framework by creating application-centric abstractions, we hope to further simplify the management of our Applications and Components by building higher-level tooling. One notable project is to provide an intuitive, user-friendly frontend for executing GitOps workflows on Applications and Component manifests. We also hope to go even further and create a one-touch infrastructure-provisioning interface to abstract away manifests altogether, and place application-centric abstractions even more front-and-center for application developers.
If you’re interested in helping us continue this journey, consider joining us at Robinhood! We’ll also be at KubeCon in San Diego later this month and look forward to connecting with many of you there.