5 best practices to get to production readiness with Hashicorp Vault in Kubernetes
As our business and engineering organization has grown, so has our core engineering platform’s reliance on Hashicorp Vault to secure sensitive data and the need to have a highly-available Vault that guarantees the continuity of our 24x7 managed detection and response (MDR) service.
We also found that as our feature teams advanced on their Kubernetes adoption journey, we needed to introduce more Kubernetes idiomatic secret-management workflows that would enable teams to self-service their secret needs for containerized apps.
Which meant that we needed to increase our Vault infrastructure’s resilience and deployment efficiency, and unlock opportunities for new secret-access and encryption workflows.
So, we set out to migrate our statically-provisioned VM-based Vault to Google Kubernetes Engine (GKE).
We knew the key to success is following best security practices in order to incorporate Hashicorp Vault into our trusted compute base.
There are a variety of documented practices online for running Vault in Kubernetes. But some of them aren’t up-to-date with Kubernetes specific features added on newer versions of Vault, or fail to describe the path to take Vault securely to production-readiness.
That’s why I created a list of architectural and technical recommendations for Expel’s site reliability engineering (SRE) team. And I’d like to share these recommendations with you. (Hi, I’m David and I’m a senior SRE here at Expel.)
After reading this post, you’ll be armed with some best practices that’ll help you to reliably and securely deploy, run and configure a Vault server in Kubernetes.
What is Hashicorp Vault?
Before we dive into best practices, let’s cover the basics.
Hashicorp Vault is a security tool rich in features to enable security-centric workflows for applications. It allows for secret management for both humans and applications, authentication federation with third-party APIs (e.g.: Kubernetes), generation of dynamic credentials to access infrastructure (e.g.: a PostgreSQL database), secure introduction (for zero trust infrastructure) and encryption-as-a-service.
All of these are guided by the security tenet that all access to privileged resources should be short-lived.
As you read this post, it’s also important to keep in mind that a Kubernetes cluster is a highly dynamic environment.
Application pods are often shuffled around based on system load, workload priority and resource availability. This elasticity should be taken into account when deploying Vault to Kubernetes in order to maximize the availability of the Vault service and reduce the chances of disruption during Kubernetes rebalancing operations.
Now on to the best practices.
Initialize and bootstrap a Vault server
To get a Vault server operational and ready for configuration, it must first be initialized, unsealed and bootstrapped with enough access policies for admins to start managing the vault.
When initializing a Vault server, two critical secrets are produced: the “unseal keys” and the “root token.”
These two secrets must be securely kept somewhere else – by the person or process that performs the vault initialization.
A recommended pattern for performing this initialization process and any subsequent configuration steps is to use an application sidecar. Using a sidecar to initialize the vault, we secured the unseal keys and root token in the Google Secret Manager as soon as they were produced, without requiring human interaction. This prevents the secrets from being printed to standard output.
The bootstrapping sidecar application can be as simple as a Bash script or a more elaborate program depending on the degree of automation desired.
In our case, we wanted the bootstrapping sidecar to not only initialize the vault, but to also configure access policies for the provisioner and admin personas, as well as issue a token with the “provisioner” policy and secure it in the Google Secret Manager.
Later, we used this “provisioner” token in our CI workflow in order to manage Vault’s authentication and secret backends using Terraform and Atlantis.
We chose Go for implementing our sidecar because it has idiomatic libraries to interface with Google Cloud Platform (GCP) APIs and reusing the Vault client library already included in Vault is easy – which is also written in Go.
Pro tip: Vault policies govern the level of access for authenticated clients. A common scenario, documented in Vault’s policy guide, is to model the initial set of policies after an admin persona and a provisioner persona. The admin persona represents the team that operates the vault for other teams or an org, and the provisioner persona represents an automated process that configures the vault for tenants access.
Considering the workload rebalancing that often happens in a Kubernetes cluster, we can expect the sidecar and vault server containers to suddenly restart. Which is why it’s important to ensure the sidecar can be gracefully stopped and can accurately determine the health of the server before proceeding with any configuration and further producing log entries for the admins with an initial diagnosis on the status of the vault.
By automating this process, we also made it easier to consistently deploy vaults in multiple environments, or to easily create a new vault and migrate snapshotted data in a disaster recovery scenario.
Run Vault in isolation
We deploy Vault in a cluster dedicated for services offered by our core engineering platform, and fully isolated from all tenant workloads.
We use separation of concerns as a guiding principle in order to guarantee the principle of least privilege when granting access to infrastructure.
We recommend running the Vault pods on a dedicated nodepool to have finer control over their upgrade cycle and enabling additional security controls on the nodes.
When implementing high availability for applications, as a common practice in Kubernetes, pod anti-affinity rules should be used to ensure no more than one Vault pod is allocated to the same node. This will isolate each vault server from zonal failures and node rebalancing activities.
Implement end-to-end encryption
Even for non-production vaults you should use end-to-end TLS. When exposing a vault server through a load balanced address using a Kubernetes Ingress, make sure the underlying Ingress controller supports TLS passthrough traffic to terminate TLS encryption at the pods, and not anywhere in between.
Enabling TLS passthrough is the equivalent of performing transmission control protocol (TCP) load balancing to the Vault pods. Also, enable forced redirection from HTTP to HTTPS.
When using kubernetes/ingress-nginx as the Ingress controller, you can configure TLS passthrough with the Ingress annotation nginx.ingress.kubernetes.io/ssl-passthrough.
Configuration for the Ingress resource should look as follows:
Ensure traffic is routed to the active server
In its simplest deployment architecture, Vault runs with an active server and a couple hot-standbys that are often checking the storage backend for changes on the writing lock.
A common challenge when dealing with active-standby deployments in Kubernetes is ensuring that traffic is only routed to the active pod. A couple common approaches are to either use readiness probes to determine the active pod or to use an Ingress controller that supports upstream health checking. Both approaches come with their own trade-offs.
Luckily, after Vault 1.4.0, we can use the
service_registration stanza to allow Vault to “register” within Kubernetes and update the pods labels with the active status. This ensures traffic to the vault’s Kubernetes service is only routed to the active pod.
Make sure you create a Kubernetes RoleBinding for the Vault service account that binds to a Role with permissions to
patch pods in the vault namespace. The vault’s namespace and pod name must be specified using the Downward API as seen below.
Enable service registration in the vault
.hcl configuration file like this:
VAULT_K8S_NAMESPACE with the current namespace and pod name:
With the configuration above, the Kubernetes service should look like this:
Configure and manage Vault for tenants with Terraform
Deploying, initializing, bootstrapping and routing traffic to the active server are only the first steps toward operationalizing a vault in production.
Once a Hashicorp Vault server is ready to accept traffic and there is a token with “provisioner” permissions, you’re ready to start configuring the vault authentication methods and secrets engines for tenant applications.
Depending on the environment needs, this type of configuration can be done using the Terraform provider for Vault or using a Kubernetes Operator.
Using an operator allows you to use YAML manifests to configure Vault and keep their state in sync thanks to the operator’s reconciliation loop.
Using an operator, however, comes at the cost of complexity. This can be hard to justify when the intention is to only use the operator to handle configuration management.
That’s why we opted for using the Terraform provider to manage our vault configuration. Using Terraform also gives us a place to centralize and manage other supporting configurations for the authentication methods.
A couple examples of this is configuring the Kubernetes service account required to enable authentication delegation to a cluster’s API server or enabling authentication for the vault admins using their GCP service account credentials.
When using the Kubernetes authentication backend for applications running in a Kubernetes cluster, each application can authenticate to Vault by providing a Kubernetes service account token (a JWT token) that the Vault server uses to validate the caller identity. It does this by invoking the Kubernetes TokenReview API on the target API server configured via the Terraform resource
Allow Vault to delegate authentication to the tenants’ Kubernetes cluster:
Once you’ve configured Vault to allow for Kubernetes authentication, you’re ready to start injecting vault agents onto tenant application pods so they can access the vault using short-lived tokens. But this is a subject for a future post.
Are you cloud native?
At Expel, we’re on a journey to adopt zero trust workflows across all layers of our cloud infrastructure. With Hashicorp Vault, we’re able to introduce these workflows when accessing application secrets or allowing dynamic access to infrastructure resources.
We also love to protect cloud native infrastructure. But getting a handle of your infrastructure’s security observability is easier said than done.
That’s why we look to our bots and tech to improve productivity. We’ve created a platform that helps you triage Amazon Web Services (AWS) alerts with automation. So, in addition to these best practices, I want to share an opportunity to explore this product for yourself and see how it works. It’s called Workbench™ for Engineers, and you can get a free two-week trial here.
Check it out and let us know what you think!