In Search of the Elastic Database

The relational database - or RDBMS - is a triumph of computer science.  It has provided the data management layer for almost all major applications for more than two decades, and when you consider that the entire IT industry was once described as "data processing," this is a considerable achievement. 

For the first time in several decades, however, the relational database stranglehold on database management is loosening. The demands of big data and cloud computing have combined to create challenges that the RDBMS may be unable to adequately address.  

Over the past decade, we have seen an "industrial revolution" of data. During the industrial revolution of the 18th century, automation took over the production of goods that previously were created by hand, resulting in an explosion in manufacturing capacity. In the industrial revolution of data, enterprise data entry changed from in-house manual labor to an external feed of data from a variety of sources including customers, the public, point-of-sale devices, GPS, mobile devices, and RFID readers, resulting in an explosion in data generation.    

Until recently, processing and storage capacities have expanded in lock step with the increased storage and transaction rate demand. Moore's law predicts exponential increases in CPU, memory and storage capacity, which have been sufficient to sustain, and sometimes feed, the growth in data storage and processing demands. The explosion of data has outpaced the growth in RDBMS capacity predicted by Moore's law, however.  Many enterprises, including telecommunications companies and social networking leaders such as Twitter and Facebook, simply could not reliably or economically store or process all of their data in a single relational database.  

Unfortunately, unlike most other elements in the application stack, relational databases scale-out very poorly.  With a few exceptions, it's not possible to simply cluster your database and add nodes to the cluster as workload increases.  Relational databases work best when all the data is managed by a single node.

Big data, therefore, generates a desire for an elastic database that can scale across multiple computer hosts.  

Cloud computing frameworks like Amazon AWS and Microsoft Azure face a different, though related, challenge. The economic vision for cloud computing is to provide "pay as you go" computing resources on demand.  A pool of computing resources can exploit economies of scale, and a leveling of variable demand, by adding or subtracting computing resources from applications as workload demand changes.

Virtualization allows for sharing of workloads within a physical host computer, but to share across hosts requires some form of clustering technology. Web and application servers can form scalable clusters, but databases are another story. Database clusters of the "shared nothing" variety are commonplace, but require time-consuming and disruptive rebalancing when nodes are added or removed. Shared disk clustered databases such as Oracle's Real Application Servers (RAC) can increase or decrease cluster members faster, but have an unattractive licensing model for the cloud, generally are not deployed with very large node counts, and do not totally solve all the issues involved with scaling IO rates and data storage So, purveyors of cloud-based computing infrastructures also need an elastic database.

Database Sharding

Solutions to the "database as the bottleneck" dilemma have undergone a couple of major iterations. During the original explosion of web-enabled multi-tier applications, the common approach was to place the database on the most powerful hardware available. After the internet gold rush, the emphasis on total cost of ownership encouraged a greater focus on cost reduction, sparking interest in open source stacks and commodity hardware.  Databases such as MySQL, running on commodity x86 hardware, were much cheaper, but also less scalable.

MySQL on x86 Linux became the de facto standard for web 2.0 operations. A single such database could scale to 4 CPUs maximum, so a scale out solution was needed and the eventual solution involved three key technologies and techniques:

  • Replication can be used to create multiple copies of the database. Master-slave replication is used to create read-only copies for read-only queries. Master-master replication is possible, but entails a high risk of data inconsistency.
  • A read-only distributed object cache known as memcached keeps frequently-accessed database information in memory. Reading from this cache avoids database access altogether.
  • "Sharding" involves splitting the database into multiple chunks, each of which contains a horizontal partition of key information. Sharding is typically used to spread the write load across multiple databases without resorting to master-master replication.

By the time you've implemented a memcached, replicated, sharded database architecture, there's not much of the relational database left. Atomic transactions are no longer strictly implemented, SQL operations on the entire database are no longer possible, and joins and other complex operations must be implemented in middle tier code. In addition, maintenance operations are problematic: schema changes now need to be processed across multiple hosts in parallel, leading to inconsistencies both in the data and the schema.

Not only have the salient and most desirable features of the RDBMS been lost, but we've also created a maintenance nightmare.  Consequently, sharding has been described as "operationally obnoxious."

Having lost all the advantages of the RDBMS, and gained a lot of operational headaches in return, those involved in cloud-based infrastructure services and large-scale web applications almost simultaneously came to the same conclusion: we need a non-relational distributed DBMS.  As an acronym, NRDDBMS didn't catch on; instead the mostly inaccurate and widely regretted "NoSQL database" term was coined.

Big Data at Google

Long before massive websites experimented with sharding, Google confronted the granddaddy of all "big data" - web indexing. 

The number of active websites has been growing exponentially since before Google even existed. It has always been too massive for RDBMS and its access patterns would be difficult for the RDBMS to support even without the data volume issues. Google needed a solution that could scale with the web, but offer high performance without bankrupting the company.  Eventually, these requirements led to BigTable, which is the founder of a large branch of the NoSQL family tree.

A BigTable table may superficially resemble an RDBMS table, but can support a large, variable number of sparsely populated columns. Column families may contain millions of columns, with few of them common across all rows. These myriad columns often represent the name/value pairs that provide the ability for a BigTable entry to implement master-detail information in a single "row."

BigTable, which underlies many significant Google technologies including web indexing, Gmail and Google reader, has inspired several other NoSQL databases, such as HBase and HyperTable.     

It's a fact that when evolving from RDBMS to a more elastic data store, "something's got to give."  This is embodied in the now-famous "CAP Theorem," which states that a distributed database can support only two of the following three features:

  • Consistency, in which all sessions see the same view of the data at any point in time.
  • Availability, in which the database can continue to provide full functionality when cluster nodes fail.
  • Partition tolerance, in which the database continues to function even if some network segments are unavailable.

While the various non-relational databases differ in their acceptance of the CAP theorem, most of the new non-relational databases sacrifice some level of consistency for partition tolerance.  Put another way, they emphasize scalability ahead of RDBMS-style "ACID" transactions.

A Taxonomy of NoSQL

The term "NoSQL" was coined as a convenient way to assemble the community of developers working on non-RDBMS databases. As an adjective, however, it leaves much to be desired: it defines what these databases are not, rather than what they are, and explicitly excludes SQL language support rather than more pertinent features such as transactional integrity.

Various classification systems for NoSQL have been proposed. The most common categorizes NoSQL in terms of their underlying data model, which goes something like this:

  • Key-Value stores offer fast key lookups to a data value whose structure is undefined. The "value" may have a rich structure, but the data store itself does not define that structure.
  • Table-oriented data stores are like key-value stores, but define the value as a set of columns. The columns may be static and similar to RDBMS table columns, or could contain table families as in BigTable.
  • Document-oriented data stores include a complex, self-defining document as their "value." This could be an XML document, although most document databases use JavaScript Object Notation (JSON).

Other categorizations independent of the data model include consistency and distribution models.

Revolution or Evolution?

Some proponents of NoSQL claim the era of the RDBMS is coming to an end. That seems unlikely since the RDBMS provides a strong and proven solution for a wide range of application types, while the compromises of NoSQL databases only make sense for a small segment.  I think it's safe to say that RDBMS will be the dominant model for the foreseeable future.

On the other hand, it seems clear that the era of the RDBMS as the only database architecture is closing.  Elastic "NoSQL" databases - together with other technologies such as Hadoop and in-memory transactional data stores - seem to be a better fit for certain well-defined application types. 

It's worth considering, however, that a similar movement against RDBMS occurred in the mid-1990s, failing largely because the development community forgot that data has value to the business that is independent of its role in the application that captures it. 

Following the success of the object-oriented programming revolution in the early 1990s, many in the development community considered that the mismatch between the relational data model and the object oriented programming model was unacceptable.  They said data should be persisted within a database capable of representing the object structure of the application.  A number of object-oriented database systems (OODBMS) were developed, and intense mindshare battles between the OODBMS and the RDBMS communities were waged.

Within a few years, the OODBMS wave crested, probably for the same reason that the RDBMS gained such rapid business acceptance in the first place: the RDBMS, while never the fastest transaction model, allowed non-programmers to retrieve data through the easily understood table/column structure and relatively easy-to-learn SQL.  In the pre-relational era, requests for data from the hierarchical databases of the day had to be satisfied by specialist programmers, placing a big burden on the data processing department and causing long delays for decision-makers.  For the first time, the RDBMS allowed the business to get easy access to its data.

The ease of data retrieval from the relational database allowed the data processing department to become the information technology department, and enabled the era of business intelligence. As much as anything else, this enabled the RDBMS to rapidly become the dominant paradigm.

The lesson of history is clear: elastic "NoSQL" databases must do more than simply allow applications to store and process data quickly. They must also provide the decision support and business intelligence capabilities - consistently delivered by RDBMS - that have become mandatory in the modern enterprise. Until they do, it's not likely they will receive the full blessing of the enterprise CIO.