Writen by Jonathan Katz, Principal Product Manager - Technical, AWS
Kubernetes has fundamentally changed the way organizations manage and deploy software. At its core, Kubernetes provides a unified API for infrastructure that abstracts differences between cloud providers and their underlying infrastructure so that Kubernetes users can focus on defining their application workloads. By popularizing declarative management of resources, Kubernetes has transformed the IT landscape, particularly through “GitOps” workflows. Kubernetes is also extensible, enabling users to build their own APIs through custom resource definitions and controllers.
Kubernetes has many built-in conveniences for managing “stateless applications,” or services that do not require storing data. However, at some point, most applications do need to store data either in a database or as a blob on disk. Let’s look at how we can use Kubernetes to manage data.
What is Kubernetes?
Before we look at how we can manage data with Kubernetes, we need to first understand the container, a basic building block of modern applications.
A container is a process that combines an image, which is a special filesystem, and a runtime that mounts and executes commands within the image. Container images are usually designed to house all the dependencies to run the applications stored on them. A container runs as a single user, and has a restricted filesystem that can prevent modifications during runtime. Everything in a container is isolated: one container cannot peek inside another unless it is given special permission. Because of these attributes, containers became popular for building, distributing, and running applications.
Tracking how containers are deployed in large environments can be challenging. This is where Kubernetes helps. Kubernetes is a container orchestration framework that lets users declare what they want their environment to look like and the Kubernetes engine “makes it so.” Kubernetes provides many primitives for deploying these containers, including runtime management (e.g., “Pods”, “Deployments”), networking (e.g., “Services”), dynamic configuration (e.g., “ConfigMaps”, “Secrets”), and storage, to name a few.
Kubernetes' power comes through managing workloads declaratively instead of imperatively. In an imperative workflow, you must step through each dependency one-at-a-time to deploy it, e.g., “first I set up my database, then I get the password, then I give the app the password, finally I set up the networking.” In a declarative workflow, you describe what you want: “I want to connect my app to a database and expose my app publicly.” While all the pieces of your deployment may not be ready immediately, Kubernetes will take care of deploying each component and make them available once they are set up.
GitOps for Data Management
Whether you are managing stateless or stateful workloads, Kubernetes users favor GitOps methodologies for managing their workloads. GitOps provides several principles that simplify infrastructure management, including versioning the state of your infrastructure similar to how you would with code. This is a powerful concept: like software versioned in Git, any Kubernetes manifest managed using GitOps represents the state of your infrastructure. Users can deploy infrastructure from Git similar to software, and it is seamless to move forward or rollback infrastructure changes through Git commit references.
GitOps itself is a framework for software management and can be used with a variety of tooling. This includes specialized Kubernetes tooling such Helm, Flux, and ArgoCD, or existing infrastructure-as-code that knows how to reconcile changes. Regardless of choice of tooling, the fundamental output for using GitOps with Kubernetes is a series of YAML files that describe the desired infrastructure.
While the idea of “time travel,” or the ability to move infrastructure between desired states, is a powerful concept, we need to take additional steps with this approach when managing data. For example, while we may want to resize a database instance, we do not want to lose the data residing in the database! Likewise, some database schema changes cannot be rolled back without losing information. To use GitOps for data management, we will need to use Kubernetes' extensibility to create specialized APIs for safely working with data workloads.
Approaches for Managing Data on Kubernetes
Kubernetes was initially designed for stateless workloads, but has matured for managing apps that require storage. For complex apps like databases, developers can build tools called Kubernetes Operators that understand the nuances of a specific stateful system. These Operators leverage the Kubernetes API to enable users to declaratively manage their system’s configuration.
There are many open-source Kubernetes operators that deploy databases directly on Kubernetes. These “native” operators use storage primitives such as Persistent Volumes and Container Storage Interface (CSI) drivers to work with the storage layer. Mature database operators can support required features for production deployments, including high availability, backup management, monitoring, automatic upgrade management, and elasticity.
A native Kubernetes operator for a database can simplify data management directly on Kubernetes and provide many features found in managed services such as Amazon Relational Database Service (RDS). However, while operators managing databases on Kubernetes can provide a “set-and-forget” when things work, what happens when there are problems? Troubleshooting a problem can involve looking at many more places in the stack, including the application, database, the operator, the storage layer, and the Kubernetes environment itself. This greatly increases the surface area of error and requires building a team with specialized expertise for ensuring availability.
A managed service removes many of these challenges as the service handles responsibility for compute, storage, and networking. To fit a managed service into GitOps workflows, we need to have a way to declaratively describe a database deployment and simple way to connect a Kubernetes application to it. This is where AWS Controllers for Kubernetes creates a GitOps bridge between Kubernetes and AWS.
AWS Controllers for Kubernetes (ACK): The Managed Service Experience for Kubernetes
AWS Controllers for Kubernetes (ACK) is a set of Kubernetes Operators (or controllers) that let users manage AWS services directly from the Kubernetes API. This includes services such as Amazon EC2, Amazon S3, Amazon RDS, Amazon Aurora, Amazon ElastiCache, Amazon MemoryDB, Amazon DynamoDB, and others. Each of these controllers run as containerized applications inside of Kubernetes and can have their access permissions to AWS fine-tuned using IAM roles for services accounts (IRSA). For example, ACK users can connect Kubernetes applications directly to managed databases in Amazon RDS.
Using ACK provides a Kubernetes native experience for using AWS database services. A user can manage their database from a Kubernetes manifest (a YAML file) and directly inject the connection information to their application. ACK lets Kubernetes developers deploy AWS databases in their GitOps CI/CD pipelines the same way they would use a Kubernetes native Operator.
With ACK, Kubernetes users do not need to worry about managing their databases. Amazon databases services such as Amazon RDS, Amazon Aurora, Amazon ElastiCache, Amazon MemoryDB, and Amazon DynamoDB handle all of the operational aspects of data management: high-availability with multi-AZ, backup management, monitoring, performance insights, autoscaling, and more.
For users with relational databases that have variable workloads, Aurora Serverless v2 provides a solution that scales up and down with the application. While Kubernetes lets users “pack” many databases with variable workloads on a single-node, these databases can put strain on the node and the other databases. Aurora Serverless v2 handles “heat management” to prevent “noisy neighbors” from affecting performance.
The GitOps Way to Use AWS Databases
A question at many organizations is if Kubernetes is ready to manage their production data. Many of these organizations look for the same managed experience that they get from Amazon RDS, with multi-AZ support, automated backups, easy compute and storage scaling, monitoring and metrics, automatic software patching, and more. Why settle for a managed service-like experience when you can get the full experience?
ACK lets organizations connect their Kubernetes workloads directly to AWS database services and not worry about self-managing. You can still manage and scale your applications on Kubernetes while having them backed by a proven managed database solution. Aurora Serverless v2 provides the full “Kubernetes database” experience by scaling automatically with your Kubernetes applications.
You can learn more about AWS Controllers for Kubernetes (ACK) from the project documentation, which includes tutorials on using ACK with Amazon RDS, Amazon MemoryDB, and Amazon Aurora Serverless v2. There are also videos on how ACK works and how to manage Amazon MemoryDB using ACK and blog posts that walk you through deploying an application on Kubernetes with Amazon RDS using ACK and Amazon MemoryDB.