The Undiscovered Paragon of Cloud Cost Optimization

Public cloud has skyrocketed to become the indisputable go-to destination for new IT workloads, especially for data. And while the cloud is full of advantages over on-premises data platforms, cost optimization rarely makes that list.

Since modern data professionals’ roles encompass building and maintaining this platform, they should also be aware of how to objectively reduce the data platform costs to eliminate this enormous, missed business opportunity.

In the past with on-premises data platforms, data professionals were primarily tasked with ensuring that the data was available and keeping access times fast.

Most organizations would periodically purchase new hardware to insulate and mask design inefficiencies from the users.

Every few years, the IT hardware budget grew by a (mostly) predictable percentage, and the business made the best of the platform for the hardware lifecycle. If the servers and databases needed any more of the data center resources mid-cycle—such as CPU, memory, network, and storage—IT would then purchase and provision the additional compute. Even with further resources purchased, the refresh cycle lent itself to a more predictable, cost-controlled model.

Budget Spend

Today’s data professionals must not only maintain data availability and speed, they are also tasked with selecting the correct cloud data service(s), determining the scale and criticality of the platform, and securing the boundary of the service. Yesterday’s siloed data professional is today’s “full stack” DBA. And just as the data professional’s role has shifted across time with the adoption of public cloud, so has the way IT budget spend works. Cloud replaced the more consistent model of annual CAPEX to an ongoing stream of OPEX charges, where everything becomes a monthly bill. With just a corporate credit card charge away this model provides virtually unlimited compute and storage resources—and, unfortunately, unintentional overallocation becomes the norm.

Despite this change from CAPEX to OPEX, DBAs are often still using the outdated framework to subjectively add compute resources. In other words, “My data needs all of this to go fast.” Since public cloud has eliminated constraints of hardware availability, purchasing additional resources without these restraints, nor objective measurement, has proven to be incompatible for cost control in the new “cloudy” world. Since most data services in the cloud are charged on a pay-by-allocation model, rather than the often-believed marketing buzz of pay-by-consumption, overallocated and underutilized data services can add up very quickly at the end of each month.

The cost for these “necessary” resources is always an unwelcome surprise to the company check-writer.

Cost Control

So, knowing cloud resources are more often than not overallocated, how can the modern data professional assist with controlling costs?

The overwhelming underlying miss is that rarely are the individuals tracking and scrutinizing the actual consumption rates of the compute resources (usually the infrastructure team), both on-premises and in the cloud, the same people who understand the “why” behind the resource usage (usually the data team).

To align the two silos, modern operational data pros must now add efficiency, and therefore cost optimization expertise, into their job responsibilities. To be able to add this skill, the business must allow them the time to review the patterns—and do something about it. The data pro must also not be afraid or unwilling to reduce the cloud service’s scale if the telemetry justifies it.

Resource Allocation

It’s time to be objective. Ask this simple question: “How are my servers using the resources currently allocated to them?” Basic telemetry is provided by the cloud and data services, and a bit of elbow grease will provide more insightful data. This act lies true no matter the data platform—database-as-a-service (such as Microsoft Azure SQL Database), database instances-as-a-service (Azure SQL Managed Instance or Amazon RDS for SQL Server), non-relational and other data services (such as Azure Data Lake or CosmosDB), or full VMs-as-a-service with self-managed data platform installed and configured inside.

The words “baseline” and “right-sizing” should become part of the daily vocabulary of the data pro. So should the phrase “allocation versus consumption.”

To understand the art of right-sizing, or allocating the appropriate amount of compute resources or services scale, a service resource consumption baseline should first be understood. Any cloud data service is provisioned with some form of a scale. A cloud VM is provisioned with a certain amount of CPU cores and memory, and storage is then attached to the VM. An Azure SQL Database is provisioned with either a certain number of virtual CPU cores or an arbitrary purchasing model called a Database Throughput Unit, or DTU, and comes with a finite amount of storage.

All the available telemetry must be reviewed to determine a baseline of the ongoing usage patterns. The baseline illustrates what a “normal day” of consumption looks like. Compare this baseline to the allocated compute to determine the headroom within the service.

Regardless of the provisioned scale, if the service is woefully overprovisioned, and the workload demands only consume a small subset of the available compute resources, your organization is most likely paying too much for that service.

Even if your data service is truly a pay-by-consumption model, the usage patterns still must be scrutinized to look for room for optimization. For example, if your data is loaded each night, and the loader leaves the source data around without cleaning up after itself, your organization will pay for the consumption of the stale data just sitting there taking up space.

Review and Right-Size

If a service is quite overallocated, right-size it by downsizing the scale of the service. Make sure you leave sufficient headroom for upcoming business growth cycles, but be conservative. Re-baseline after a few business days to see how the usage patterns have settled in. You might be surprised at how it either runs the same, or even that it could actually run faster with less allocated compute. (But, if you see it maxed out, you’ve gone too far in your right-sizing, so of course allocate more to maintain speed.)

Performing this review and right-sizing exercise on a single server could save thousands or tens of thousands often the monthly bill. Now, picture the impact of extending this practice to all your servers on a regular basis—by incorporating it into your weekly and monthly routine.

I can assure you that by reclaiming even a modest portion of the monthly OPEX, the only surprise your company’s checkwriter will find at the end of the month is a highly welcome one.