Using Control Theory to Automate Database Operations

There’s been a lot of discussion recently about autonomous databases, which offer the promise of keeping database systems stable and performant while alleviating much of the tuning and configuration drudgery that DBAs must typically slog through.

At the same time, there’s some debate about the meaning of “autonomous” in the context of database management systems. Some argue that autonomous databases should be fully self-contained and self-driving entities that virtually eliminate DBAs.

The real challenge for database management systems is to offer autonomous capabilities to any DBA, whether working with on-premises, private cloud, or public cloud deployments. An autonomous database needs to strike a balance between giving administrators too much and too little control. Too much control takes the form of endless configuration and tuning parameters that place the burden of performance and stability on the operator. Too little control takes the form of automated behaviors that reduce the predictability of the system to an unacceptable level.

The answer lies in kubernetes. No, not the container orchestration system! Kubernetes, Greek for helmsman, is the founding metaphor behind systems theory. The helmsman continuously adjusts his rudder and sheets, aiming for a compass heading, all while responding to feedback from the ever-shifting environment, like wind, currents, waves, and stars.

To follow this metaphor, databases are complex systems that can use feedback to continuously self-correct as conditions change.  In this article, we’ll examine how control theory can be used to strategically automate key aspects of runtime database administration.  Control theory provides the tools needed to achieve optimal operational overhead, while eliminating many routine but risky administrative tasks. This approach to automation treats the database as a closed signaling loop, in which actions generate changes in the environment, which in turn trigger changes to the system recursively.

Closed-loop control systems traditionally involve control processes and actuators that progress output to a desired state. The gap between the current state and the desired state is called the error. The error can be returned to inputs as feedback. (That’s why closed-loop control systems are also sometimes called feedback control systems.)

Take for example a real-world closed-loop control system: an industrial controller that regulates the water level in a holding tank. To maintain the water at a prescribed level, a valve functions as an actuator. Using feedback from the system,  the controller opens and closes the valve to maintain water flow from the tank.

Similar to the storage capacity in a holding tank, database resources are finite. Internal database operations compete with customer-facing workloads for shared resources: CPU, disk I/O, and network bandwidth. Just as the actuator that controls the valve, an autonomous database regulates resources available to different workloads.

As any DBA can attest, traditional databases are static and brittle in the way they handle this interplay. Operators need to fiddle with various database-specific tuning parameters on each machine. There’s very little predictability, and what might be the ideal balance today may no longer suffice tomorrow.  History has shown that that kind of centralized decision-making can be disastrous in complex systems. An autonomous database can remove the operator from the loop and automatically tune the database to achieve specific performance objectives.

Compaction as an Example

This sounds good in the abstract, but how do we concretely apply these ideas to specific database operations? How can we see autonomous capabilities at work?

One routine maintenance task that could benefit from automation is compaction. Compaction is a process that is unique to NoSQL databases with storage layers based on Log-Structured Merge-trees (as opposed to B-trees). The list of databases that employ compaction includes Google BigTable, HBase, Apache Cassandra, and Scylla. In these systems, data is written into memory structures (usually called memtables) and later flushed to immutable files (called sorted string tables, or SSTables). Over time, those SSTables accumulate and at some point must be merged together by the process known as compaction. This background task occurs only periodically. In distributed systems that store lots of data and are incorrectly tuned, it has the potential to create cascading effects that can cripple a running deployment. At a minimum, compaction processes can significantly impact response times for application users, as the compaction processes hog bandwidth and CPU to do their work.

The problem is that configuring compaction properly requires intimate knowledge of both expected end user consumption patterns and low-level database-specific configuration parameters.

An autonomous database obviates these low-level settings, using algorithms to let the system self-correct and find the optimal compaction rate under varying loads. But how does it do this specifically for compaction? As DBAs familiar with LSM-based NoSQL databases can tell you, the problem with compaction is amplification. A pile of uncompacted data can lead to both read amplification and space amplification, slowing and potentially crashing the system. Since the goal of the system is to minimize amplification, it is one property of the compaction process that can be optimized using control theory.

To do so, the system needs to measure the work required to revert to the desired state of zero amplification. We can refer to this measurement as the backlog. Backlog is the error that feeds back into the scheduler, which, like the water valve, works with a controller to drive an actuator. The efficiency goal is ‘backlog zero,’ where everything is fully compacted and there is no read or space amplification. To get to backlog zero, the system may have to rewrite the data many times.

A backlog controller can be designed to govern specific compaction strategies. For this example, we use the well-known size-tiered compaction strategy (STCS). STCS compacts together SSTables of similar size. The controller waits until 4 SSTables of similar sizes are created and then compacts them together. As SSTables of similar sizes are compacted, much larger SSTables may be generated on a new backlog tier. An autonomous controller can use this information as inputs to an algorithm that continuously adjusts resource allocation based on the calculated backlog to minimize amplification.

Autonomy in Action

To demonstrate autonomous capabilities under laboratory conditions, we can put a database system into steady state, at 100% utilization, with a simple read-only workload, writing 1KB values into a fixed number of random keys. Once the database is at equilibrium, a compaction process will naturally kick off. Suddenly, the resources available to the write-only workload are contested by the compaction process.

The internal compaction process initially consumes very little CPU time in proportion to its allocated resources. Over time, the CPU time consumed by compactions increases until it reaches steady state. The proportion of time spent compacting is fixed, and the system won’t fluctuate.

As new data is flushed and compacted, the total disk space oscillates around a specific value. In this steady state, the compaction backlog sits at a constant level where data is compacted at the same pace as new work generated by incoming writes.

A side effect on overall performance is that CPU and I/O schedulers enforce the priority assigned to internal processes, ensuring that internal processes consume resources in the exact proportion to the priority allocated to them. Since priorities are constant, latencies, as seen by the server, are predictable and stable in every percentile.

A sudden increase in request payload leads the system to ingest data faster. As the rate of data ingestion increases, the backlog expands, in turn requiring compactions to transfer more data.

With the new ingestion rate, the system is perturbed and the backlog grows faster than before. However, the compaction controller automatically increases resource allocation to compactions and the system in turn achieves a new state of equilibrium, without any operator intervention.

Freeing Up DBAs for Important Tasks

In a typical database management system, an operator would need to log in to the machine, tweak tuning parameters, and in some cases completely restart the server for the changes to take effect. In contrast, an autonomous controller continuously, and more importantly, independently, strikes a balance between compaction and foreground tasks, without any operator intervention. It does this by applying concepts from control theory to database design, using feedback from the system to dynamically allocate resources.

Automating compaction tuning is only one example of the many ways a database sub-system can ease the administrative burden for DBAs. An autonomous database shouldn’t try to be a black box, but should help extricate DBAs from risky and nerve-wracking tasks. As such, the goal of database autonomy should not be construed as the ultimate elimination of the DBA. On the contrary, an autonomous database frees up DBAs and empowers them to focus on tasks that are more important and satisfying than babysitting production systems.