Migrating to GKE: Preemptible nodes and making space for the Chaos Monkeys
As the story goes for many startups, we built what we needed in a garage (in our case it was actually literally a barn), but as our customer base grew and our use cases evolved, so did the stress on our legacy tasking infrastructure.
Last year, Expel’s Site Reliability Engineering (SRE) team and Device Detections and Tasking (DDT) team set out to migrate and update our device tasking infrastructure from our legacy environment of statically provisioned virtual machines (VM), to a more dynamically scalable, reliable, deployment within Google Cloud Platform (GCP).
We had three goals:
- High Availability (HA) – The infrastructure needs to have minimal downtime. Lag time can result in possible risk, so we get paged when tasking interruptions last for more than five minutes.
- Automatic Scaling – As a quickly growing company, manual scaling is considered toil. We need tasking infrastructure to scale up and down automatically.
- Costs – Reduce costs by 50 percent. We arrived at this percentage by back of the napkin math for how much we thought the old system was over provisioned.
But, at Expel, we never look at projects through the lens of siloed teams.
We have an overriding ethos around efficiency. Building software at a SaaS company requires constantly evaluating efficiency from multiple perspectives.
When we completed this project, we found that our analysts are more efficient (they aren’t waiting around long periods of time for tasks to complete) and we cut down on costs.
In this blog post, I’m going to share our thought process and how our integrated teams migrated Expel’s workloads to the cloud. With no downtime. (Hi, I’m Luke Jolly, Principal Site Reliability Engineer at Expel.)
If you’re interested in learning new ways to task smarter, not harder, then keep reading.
A little bit about Expel
Before I jump into the details of this project, I want to give you some background.
We built a SaaS platform to enable our security analysts to perform 24x7 managed detection and response (MDR). One of the core components of our platform is signal ingesting and processing. This signal consists of events (think audit logs) and alerts produced by technology we integrate with, such as security vendors, IaaS providers, and SaaS applications.
Each month Expel ingests tens of billions of alerts and events on behalf of our customers. The ingestion is performed by what we call our “tasking infrastructure.”
At its core, the tasking infrastructure is a scheduling and asynchronous job processing stack that handles both periodic and ad-hoc tasks. The tasks we execute range from polling a vendor’s API for the latest batch of alerts to fetching an artifact from an endpoint device for forensic analysis.
Now that you understand how we operate, and the problems we needed to solve, let’s move into…
Expel’s backend is divided into various microservices.
We use Docker containers to package our services and the de-facto container orchestration platform, Kubernetes, to manage and run them.
While knowing how to build and manage a production-ready Kubernetes cluster from scratch is a great skill to have, managing a highly available Kubernetes control plane comes at a significant engineering time cost. For most companies, the cost of managing a Kubernetes control plane outweighs the benefits of flexibility and customization.
Enter the first piece of the puzzle: Google Kubernetes Engine (GKE), Google’s managed Kubernetes offering.
Using Terraform, an infrastructure as code software tool, to orchestrate GKE, we reduced the overhead of managing Kubernetes setup and deployment drastically.
With initial setup and configuration solved, we were free to use several features of vanilla Kubernetes and GKE to solve our three goals.
Kubernetes gives us most of the infrastructure tools needed to run our tasking services in a HA way.
The health of the services is monitored by liveness and readiness probes. Service discovery and communication between tasking pods are handled by Kubernetes services. Kuberentes handles node (or VM) failure by migrating pods to healthy nodes.
As you can see, Kubernetes gets us pretty close to our HA goal. The last few steps, however, are taken specifically with GKE.
While Kubernetes has the features needed to deal with the following availability issues, the amount of code glue and setup needed to leverage these features is significant. To overcome both Kubernetes control plane outage and datacenter zone outage, we used GKE regional clusters.
A GKE regional cluster distributes Kubernetes control planes and nodes across three different zones. In this kind of cluster, a zonal outage is handled the same as a single node outage. Any pods on bad nodes in that zone are moved to healthy nodes.
There is one key difference between a single node outage and a zone outage: capacity.
Most clusters would have enough capacity to handle losing a node, but what about losing ⅔ of the nodes? No one wants to keep their cluster scaled to 3x what they need under normal operation.
Enter the GKE cluster autoscaler.
If pods enter an unschedulable state due to capacity starvation, the autoscaler will look at the available node pools and attempt to scale one up. If that scale up times out because that pool happens to be in a bad zone, it’ll continue to try pools until it successfully scales one up that is in a good zone and all pods are scheduled.
The combination of all these availability features turns what would have been a catastrophic loss of ⅔ of infrastructure resulting in significant downtime and requiring manual intervention into an automatically recovered degraded state that only lasts minutes.
If the stars align and we’ve instrumented our alerts correctly, the traditional SRE blessing will ring true:
Achieving the scaling goal starts off with where solving availability left off: the autoscaler.
What if a significant global security event takes place, and we need to scale the task pool from 1000 to 2000 pods?
The autoscaler gotchu. Just as with node outage, this will lead to capacity starvation, which will trigger the autoscaler to scale node pools until all pods are scheduled.
In a regional cluster, the autoscaler will attempt to spread the new nodes evenly across all zones. Thanks to our software engineers writing robust, horizontally scalable software, making the infrastructure scalable boils down to turning on the autoscaler flag for our GKE clusters.
A long time ago there was an engineer who may or may not be writing this blog post who responded to the toil of having to manually scale infrastructure at a rapidly growing company by saying to a coworker: “Well, scaling problems are the best problems to have.”
To this I now respond:
If a human operator needs to touch your system during normal operations, you have a bug.”
— Carla Geisser, Google SRE
At a fast growing company, “scaling” is a normal operation that should be automated.
While we’ve talked about the many steps we took to improve the reliability and scalability of the tasking system, we haven’t addressed costs yet.
To reach our cost reduction goal of at least 50 percent, we have several more tricks up our sleeve.
The autoscaler has yet another feature: scaling down.
At a set interval, the autoscaler checks if any node has less than 50 percent utilization for 10 minutes. If it does and there is room for all the node’s pods on other nodes, it’ll migrate the pods and scale down the node.
Pods need to have the correct resource requests set if you want to use node scale down. If the requests are set too high, each node will be under-utilized and the autoscaler will spin up more nodes than you actually need.
Tuning this by hand, however, is very toilsome. Instead, we use the vertical-pod-autoscaler (VPA).
This service monitors pods CPU and memory usage and will adjust requests and limits of a pod depending on how it is configured. Because our tasks have widely different memory requirements with outliers requiring several standard deviations more than the mean, we opted to define a high static memory limit and set the VPA to RequestOnly mode:
- containerName: task-executor
While all this assures we’re not overscaled, it doesn’t help reduce the cost of the many nodes needed to run the tasking services.
In our case, the first step we took was to take advantage of committed use discounts for the portion of our tasking infrastructure that we knew we would always need and is of a more static nature. This, however, won’t cover new instances as we grow or cover extra instances during bursts. To make it even worse, the autoscaling nature of our system leads to churn of our nodes – meaning we don’t even get sustained use discounts.
This discount requires a node to be spun up for at least a week before it starts applying. One option is to use machine types of E2. These VMs don’t have sustained use discounts, but their base cost is almost identical to the cost of an N1 with sustained use discount.
Anyone who looks at GCP’s compute pricing page will notice another column which is in fact the cheapest of all: “preemptible.”
Reaching for the sun, we decided to use these so-called preemptible instances to make our more financially savvy check book managing co-workers proud.
Accepting the Chaos Monkey
If preemptible nodes are by far the cheapest, why doesn’t everyone use them for everything? To put it simply, they are by design unreliable and unguaranteed. They are excess Compute Engine capacity, so their availability varies with zone and datacenter usage.
Their main caveats are as follows:
- Instances can be “preempted” (shutdown/terminated) at any time and will always be preempted after 24 hours of up time.
- They have no availability guarantees. I have personally seen a preemptible node shortage in a specific Google region that lasted half of a day.
- During preemption, nodes only have 30 seconds of notice before termination.
- Preemption causes Kubernetes nodes to terminate pods ignoring the configured Pod grace period and pod disruption budgets.
- There is a possibility of a preemption rush. This is when a high percentage of a cluster’s preemptible nodes are terminated at once. This can happen if they happen to all be created at the same time or if Google runs out of preemptible nodes.
When I first read through the caveats of preemptible nodes half a decade ago, the old school systems administrator in me recoiled in fear and disgust.
On re-evaluation, however, I realized that if we were designing our systems and software correctly for an automatically scaled Kubernetes environment, then they should already be able to handle nodes regularly being removed or preempted.
In essence, using preemptible nodes was akin to using Chaos Monkey, a program developed at Netflix that randomly injects failures into their system such as node termination.
The reasoning behind this comes from a relatively new discipline referred to as chaos engineering.
A concept I learned early in my career is the idea that an untested backup system is almost worse than no backup system at all. This is because it gives a false sense of security. There are innumerable stories of businesses trying to recover after catastrophic data loss only to find their backup system was flawed or unreliable.
Chaos engineering takes this concept and generalizes it to complex production systems. Making assumptions about a system without testing, such as how reliable it is in various failure states, is folly. As the ole saying goes:
Checketh Thyself Lest Ye Wrecketh Thyself”
Preparing for Chaos
Armed with this new view of chaos, we dove into how to prepare our software and systems to reliably run on this unreliable infrastructure.
We inspected the service running on preemptible nodes, the task-executor. This service was the worker part of the system that accounted for over 90 percent of the tasking pods. Here are some important characteristics of the task-executor workload:
- It’s stateless.
- When a task-executor process exits gracefully or ungracefully, the scheduler process sees the closed GRPC connection and marks its current task as errored and thus retried.
- The tasking system was designed to handle horizontally scaling the task-executors.
Given these features, this workload is a perfect candidate for preemptible instances, but there were still some problems we needed to solve.
The first was that GCE node termination events (preemptions) did not translate into the standard graceful shutdown process of a Kubernetes node.
This process consists of:
- Tainting the node to prevent new pods from being scheduled on it.
- Drain pods from the node starting with non kube-system namespace pods (logging agents and such in this namespace need more time to drain) followed by the kube-system pods.
Luckily, there’s already a service designed to do this: k8s-node-termination-handler.
We configured it to give non kube-system pods 14 seconds to gracefully exit before forcefully deleting them. Not only does this give the existing task-executor process time to attempt to finish its current task, it gives Kubernetes a head start trying to spin up their replacement pods.
This causes less task churn.
The other issue with preemptible nodes that must be solved is the lack of availability guarantees.
We surmounted this by creating two node pools that the task-executor pods nodeSelector matched: one preemptible and one not.
This simple solution was made possible by, yet again, the autoscaler. The autoscaler considers cost when choosing what node pool to scale. As a result, it’ll alway try to scale the preemptible pool first before falling back to the normal pool if there is a preemptible instance shortage in that region/zone.
Want to spend time building applications, not monitoring their security? We get it.
We also found that after the non-preemptible fallback pool gets scaled up because of preemptible shortage, the autoscaler will eventually scale these back down.
This behavior is promoted by two things. First, the node pools are dedicated to a single large deployment. Second, the deployment is set to be able to surge to 25 percent. This results in a significant amount of new nodes being spun up each rolling deploy. The autoscaler will then quickly follow behind and clean up the old nodes that have low resource utilization, eventually including the fallback pool.
As with any project, there’s always room for improvement. There are a few improvements we didn’t make (at least not at this time) for various reasons. For example, we didn’t address preemption rushing. There’s a service available designed to help mitigate this issue: estafette-gke-preemptible-killer.
We opted not to use this service because of three outstanding issues.
- It does not drain the pods in a way that respects pod disruption budgets (corresponding Github issue also filed by David Quarles, one of our Senior SREs).
- It works by assigning an expiration time with a random jitter which can feasibly result in a deletion rush, though unlikely (same github issue).
- It’s using a no longer maintained Kubernetes Go client. But there are some signs of movement.
Another caveat which did not need to be addressed in our case was scaling back down of the fallback non-preemptible node pool. In a non-dedicated node pool environment, there might not be enough node churn to cause the fallback pool to be scaled back down.
Estafette also wrote a service to manage this called estafette-gke-node-pool-shifter. It unfortunately has the same issues as the gke-preemptible-killer.
The task-executors could be scaled horizontally automatically using horizontal pod autoscaler. When used in conjunction with VPA, HPA should be driven by a custom metric and not CPU or Memory utilization.
Eventually, we‘ll implement this using the size of the task backlog to drive the scaling.
GKE has a very powerful feature that allows for automatic node pool creation. Node auto provisioning removes the need to create special preemptible node pools for our task-executors.
It would see that there is no node pool that matches the specified tolerations and nodeSelector and create it automatically. It also leads to perfectly sized nodes which reduces overhead and costs. Unfortunately the Terraform Google provider does not support enabling this feature in a way that we’re comfortable with. It doesn’t support enabling node auto repair for the node pools it creates.
You can track this progress on this Github issue that David Quarles filed after discovering. Also note that node pool auto-provisioning isn’t available for E2 until GKE 1.19.7-gke.800.
You can’t do it alone
We migrated the core backend of Expel across cloud providers from statically provisioned CentOS VMs to a dynamic GKE cluster with minimal disruption and no downtime, thanks to the hard efforts and collaboration between the SRE and DDT teams.
This project is a prime example of the importance of removing barriers between operational teams and development teams.
Through careful coordination, we used this migration to increase both the stability and the scalability of the tasking system. Overall, the new tasking infrastructure costs nearly 70 percent less in compute costs than the old, decreasing Expel’s total recurrent compute costs. It also laid the groundwork for being able to migrate other appropriate workloads to preemptible instances for even more future savings.
We strongly believe in the importance of building a community and sharing knowledge. It’s why we build feedback loops into every project we develop and it’s why we know that there’s always opportunity to find new efficiencies.
And we also believe that this community extends beyond our fellow Expletives. When we find new ways to increase productivity – we share it with all of you.
So, if you’re using Kubernetes as a core platform, I invite you to check out the Expel Workbench™ for Engineers.
I’m excited about Expel Workbench for Engineers because it gives my fellow engineers the same security I get by working at Expel (meaning that I don’t have to worry about security while I’m in the middle of building the things).
And with that I will leave you with one more quote:
If at first you don’t succeed, git commit -a –amend –no-edit && git push -f”
— Luke Jolly