The Birth of an Emergent Database

Jan 3, 2013

By Wiqar Chaudry

The emergence of web-scale apps has put us in the midst of a database crisis. Mobile apps, cloud-based SaaS/PaaS architectures and the distributed nature of the web have forced the software industry to make difficult compromises on how they collect, process and store data.

While traditional databases provide the power and simplicity of SQL and the reliability of ACID, they don’t scale without herculean-inspired workarounds. Newer, NoSQL solutions come close but don’t quite make the last mile. They’re designed to scale elastically, even on commodity hardware, but force developers to program powerful querying features into their application and throw away years of learning SQL skills, tools and languages.

A database that meets today’s requirements should be SQL, ACID-compliant if you need it, elastic and able to perform on commodity hardware. It should behave naturally like users of a social network, the web itself or a flock of birds taking flight. It should be resilient so that if one process within it goes down the whole database doesn’t go down with it. It should be emergent.

To give developers a truly modern solution for this century, we need to rethink how we process the collection and storage of data. It’s time for a revolution.

The SQL Database

For decades, developers have come to rely on traditional databases built on industry standards – the simplicity and power of SQL and the reliability of ACID guarantees. Over the years, these systems have evolved and adapted to meet the demands of our ever-expanding need to generate, consume and process more data, faster. But … there is a problem. Traditional SQL databases were not designed for web-scale apps or modern data center architectures.

SQL Tradeoffs

What does it take to build a web-scale app on a traditional SQL database?

A) Bigger machine

B) Lots of bigger machines

C) Developer acrobatics

D) All of the above

Answer: D) All of the above

Talk to any database vendor, and the answer is always “you can run the database on a bigger machine or two or more.” For companies and developers who have the need and the means, this approach might work… for now. However, what happens if they run out of capacity again, or worse, what if demand weakens and they need to scale back? Companies that get locked into bigger machines incur greater costs to maintain their data centers and expose themselves to an excessive amount of disaster recovery and business continuity-related risks. In order to mitigate that risk, companies have two options: they can either buy even more expensive hardware or resort to acrobatics.

Developer Acrobatics

There are several types of database bottlenecks that developers and database administrators need to address. These limitations can be I/O, CPU, memory and/or user related. Some details:

I/O, CPU and memory-related limitations are obvious. Basic SQL operations such as create, read, update and delete even at moderate frequencies are very resource-intensive. Every hardware configuration has its resource limitations. Each database host machine is only able to accommodate a fixed number of maximum connections and throughput. For companies that have the means, these limitations are generally mitigated through the acquisition of more powerful hardware. For companies that have reached a hardware peak, developers resort to the practice of clustering and sharding their systems.

Clustering and Sharding

The practice of clustering physically distributes a database from a single monolithic instance to more of a master / slave configuration. This solves a lot of the I/O, CPU and memory related resource constraints associated with the master. Developers can point write-intensive operations to the master database and run read-intensive operations from one of the slave instances. However, clustering a database just replicates data from the master to the slave instances in an unidirectional manner. This means that any changes made to the data on the slave instances will not be synchronized back to the master database.

Another widely-used technique for easing resource constraints related to managing super large volumes of data in a single database table is sharding. Sharding is the practice of horizontally partitioning database tables across a number of host machines. On its surface, this sounds like a novel idea, however, sharding makes a bad problem even worse.

The practice of sharding causes many administrative headaches for developers in terms of key management. Fundamentally, a database table has a primary key that consists of data from a single column or a combination of columns. When developers and DBAs shard a table, the database inherently can only enforce uniqueness of a primary key within a shard. This means that the burden of enforcing uniqueness across all of the records in a table now falls on the application developer.

Users also cause headaches for DBAs and developers. In some cases, they execute complex resource-intensive queries that can bring the database to a crawl. These queries often perform joins across a normalized database or load excessive amounts of data into memory. The natural reaction of developers is to denormalize the database to give users better performance. This too has unintended consequences. By gaining a little bit of performance, you have introduced data redundancy and more complexity in terms of analyzing the data and its relationships.

Enter the Cloud-like Cluster?

Even the most sacred and highly-sensitive data is being moved to the cloud (think electronic health records or financial data). Database vendors are not at all responding to the fact that Web apps are where the industry is headed. Most have only responded with the Master/Master cluster configuration “remedy.” Here is what they promise:

There is no single point of failure because all nodes or database instances are configured as the master. Sharding is handled automatically. Data gets replicated both synchronously and asynchronously. Developers can dynamically add and remove nodes without bringing down the system. Most importantly, the clustering acrobatics are abstracted away from the application. Sounds like database utopia! Or not?

As always, the devil is in the details. Some of the shortcomings of these workarounds include very complex configurations, rigidity when it comes to scaling and no such thing as a one-click "add host" option. Beyond that, many unpleasant tradeoffs have been made when it comes to supporting critical things like the SQL standard. Talk to anyone who has deployed a multi-master cluster and they will tell you that when it’s finally up and running it works great, until there is a problem. The bottom line is that the database vendors have not done what is really required. That is to completely re-think databases to behave the way web-scale apps need. Basically the acrobatic remedies are just putting lipstick on a pig.

Getting Closer: Enter the NoSQL Solution

You appreciate the multiple challenges that developers face when it comes to scaling traditional databases into the cloud. You can also see that your established database vendors are generally too slow to react to these needs and have not provided a solution that works the way web-scale cloud apps do. So what are developers supposed to do?

Many developers gave up, rolled up their sleeves and created their own solution. This in many ways was the genesis of the NoSQL solutions. Major companies adopted NoSQL strategies to mitigate the shortcomings of traditional database systems in the cloud. This includes the likes of Google and Facebook. These new solutions are designed to scale elastically, on commodity low cost hardware mimicking the ebb and flow of web traffic. But… there is still a problem.

NoSQL solutions put a requirement on the developer to program into the application the powerful querying features normally found in SQL. NoSQL software tends to lack ACID guarantees that are commonplace in SQL database technology today.

This likely requires you to tap into some SQL jockeys that know how to query and analyze data through commonly-available SQL clients. Are you interested in building apps that require ACID compliant transactions (think banks or healthcare apps)? If you answered yes to either of these questions, then maybe NoSQL can be a part of your solution, but it can’t be the only solution.

Building a Web-Scale Database

So …what’s it going to take to get a real web-scale database going without the acrobatics?

One that doesn’t force developers to make compromises. Or ignores years of your developing SQL skills, or learning BI tools and programming languages. You should not be forced to adopt a NoSQL solution for part of your application and a SQL solution for another part just because you need ACID-compliant transactions. Most importantly, it doesn’t force you to divert focus from your app development to deal with the extraneous code required to get a database to behave the way it should

The Emergent Database

Today’s web-scale apps are often consumer-facing and tend to behave like natural systems. There is a predictable ebb and flow to how these systems work. At the same time, many web-scale apps will stay dormant for long periods of time only to explode with a burst of activity at a moment’s notice. In fairness, perhaps these demands are simply too unnatural to impose on database systems that were never designed to handle them. Instead of complaining about the limitations of databases we might praise them for keeping up for as long as they did. But in this new era of commoditized, virtualized and consolidated datacenters, skinny budgets and given the movement to the cloud, no matter how you rationalize the predicament, it is time for a change.

We live in a 21^st century world and we consume 21^st century apps. The right solution isn’t the evolutionary process of updating 20^th century database technology.

It’s time for a database revolution - a complete re-think of how the process of collecting and storing data needs to work. The solution needs to reflect 21^st century IT requirements and the needs of today’s web app developers.

Today’s database should be 100% SQL so that the petabytes of data that have lived in traditional databases can be migrated with ease. It should be ACID if required so that existing applications can also be migrated. It should be elastic so that applications can explode into action when demand is high and contract when the resources are no longer required. Today’s database should perform and scale on commodity hardware lowering not only the cost of entry but also the cost of growth. It should behave naturally like users of a social network, the web itself or a flock of birds taking flight. It should be resilient so that if one process within it goes down the whole database doesn’t go down with it. It should be emergent.