It’s rare in the world of software to see a single architecture dominate as comprehensively as the relational database model. The relational database (RDBMS)—actually a hybrid of Codd’s relational data model, Gray’s ACID transaction model, and the SQL language—defined every significant new database system released between 1985 and 2000.
The incredibly powerful hegemony broke abruptly during 2008 and 2009. Suddenly, a variety of non-relational systems emerged, and in the 7 years following, more database management solutions arose—none of which adhered to the relational/ACID pattern. We went from a “one-size-fits-all” to a “made-to-measure” world. Although the variety of database technologies represents a broadening of options, it does also result in a somewhat more confusing world for database professionals.
For more insight into big data technologies, read the new Big Data Sourcebook
SQL Still Rules, OK?
The RDBMS model remains by far the dominant database technology today, whether measured by revenue, installations, or mindshare. For instance, the DB-Engines site ranks database systems according to their popularity using website references, search frequency, job postings, social network, and Stack Overflow mentions. It’s therefore heavily biased toward emerging technologies that generate more “buzz.” Nevertheless, on its ranking, Oracle, SQL Server, and MySQL each generate a score four times higher than the most popular non-relational alternative (MongoDB).
That having been said, it’s absolutely true that the growth in RDBMS has slowed, while the upstart alternatives have gained significant traction. RDBMSs are losing ground in two key demographics:
- Modern web applications are now often built using an open source non-relational system, most commonly MongoDB, but also including Cassandra, Couchbase, and Redis.
- Big data systems, most notably the Hadoop/Spark combination, are pulling revenue from the largest data warehousing projects. As we will see, RDBMSs still remain dominant in data warehousing, but it’s no longer necessary for a data warehouse to be able to cope with the same massive amounts of data, some of which can now be offloaded to Hadoop.
Although the RDBMS is losing ground in these two key areas, it continues to maintain strongholds within the enterprise:
- On-premises CRM and ERP systems, and indeed a very wide variety of packaged applications, are still utterly dependent on relational schemas. However, it is also true that an increasing amount of this application workload is moving into cloud-based systems. These cloud-based systems may still be backed by relational stores, but this is transparent and irrelevant to the enterprise buyer.
- Real-time data warehouses and BI tools still rely heavily on the relational “star schema” model and on the SQL language. It’s possible through great effort to create a sort of “big data” real-time data warehouse experience, but this requires cobbling together multiple technologies. For the time being, the best place for a curated “single version of the truth” remains the relational star schema.
The relational database, a triumph of software engineering, is still the best choice for the widest variety of workloads. However, the trend seems clear: While the relational database will remain significant, and may dominate over the next few years, it will be experiencing diminishing growth and reduced relevance.
“Big data” is an overused and somewhat poorly defined term but essentially refers to the coupling of massive, fine-grained datasets with smarter, adaptive algorithms to derive more value from data. Google is indisputably a pioneer of big data, having applied a unique smarter algorithm (“PageRank”) to the massively growing dataset of the early web. When Google published details of its internal “big data” technologies—Google File System, MapReduce, and Bigtable—it inspired the creators of Hadoop, who made an open source version of these core Google technologies.
The early nurturing of Hadoop by Yahoo allowed Hadoop to be proven at scale. By the time it emerged as a mainstream technology around 2010, Hadoop had already established it was capable of economically storing and processing some of the world’s largest datasets.
Hadoop’s big drawback—its slow, disk-bound, batch-oriented nature—was alleviated by the emergence of Spark, an in-memory MapReduce framework which also provided easier programming semantics. The addition of SQL language capabilities to Hadoop and Spark through the Hive and SparkSQL projects opened the Hadoop system to non-programmers. Hadoop and Spark became a viable supplement to the traditional data warehouse and an essential ingredient in most big data projects.
Hadoop and Spark only solved one element of the big data challenge: the provisioning of economic storage and processing capabilities for data at virtually any scale. The other element of the challenge, the development of “smarter” algorithms, remains a work in progress. While significant advances in machine learning and AI have occurred, practical application of these advanced algorithms still requires very highly skilled data scientists, who are in short supply. The failure to leverage the data held in big data systems has led to some disillusionment with the big data concept. Nevertheless, Hadoop and Spark have provided the necessary storage and processing infrastructure without which data science challenges could not even be attempted.
While Google was solving its big data problem, Amazon was facing unique challenges of its own. Scaling a global online transactional system was proving to be impossible within the constraints of the relational database—in particular, the ACID transactional model.
It turns out that if you want to have a highly available online application with a global scope, you can’t also support immediately consistent transactions. The best you can aspire to is eventual consistency. For instance, if there is a break in the network between Australia and North America, you simply can’t keep processing orders from both countries unless you are prepared to allow each country to see slightly different views of the database.
To solve this dilemma, Amazon created a new type of eventually consistent database—Dynamo—which became the basis for its own DynamoDB as well as for several popular NoSQL databases such as Cassandra and Riak.
A final important category of NoSQL is the graph database. To efficiently process a data model in which complex networks of relationships between objects or entities are a primary focus, this type of database is called for, though graph processing capabilities are increasingly found in mainstream database systems as well. Graph structures are familiar to us from social networks such as Facebook. In Facebook, the information about individuals is important, but it’s the network between people—the social graph—that provides the unique power of the platform.
Some architects of database systems were reluctant to “throw the baby out with the bathwater” by completely disposing of the relational model or full SQL compatibility. Consequently, while no new databases have emerged over the past 7 years that faithfully implement the RDBMS pattern, there have been some which retained core relational features and also enhanced or significantly modified fundamental principles. For instance, in columnar databases, such as Vertica or SAP Sybase IQ, data for specific columns is stored together on disk as opposed to the traditional row-oriented disk storage. In-memory databases such as HANA and Oracle TimesTen store all data in memory guaranteeing fast access times albeit at much higher cost per gigabyte.
NewSQL seems to be diminishing as a distinct category—most of the NewSQL innovations have found their way into traditional RDBMS systems. For instance, the latest release of Oracle Database includes columnar and in-memory technologies.
The State of Play
Some of the new database vendors like to position their technology as offering a complete replacement for the venerable RDBMS and suitable for all workloads. However, in reality, no single database technology currently offers an optimal solution for all workloads, and each presents significant compromises. An RDBMS may be superior in terms of query access and transactional capability but fail to deliver the network partition tolerance required for a global application. A database such as Cassandra may deliver the best cross-data center availability but fail to integrate with BI systems. MongoDB might support the most agile development lifecycle but require more effort to massively scale.
As a result, the current best practice is to combine multiple database technologies in an enterprise data architecture. The key elements of such an architecture include:
- Hadoop and Spark, a platform within which masses of semi-structured and unstructured data can be economically stored and analyzed
- Non-relational operational databases, such as Cassandra and MongoDB, providing a platform for web applications that can support global scale and continuous availability and which can rapidly evolve new features
- In-memory systems, columnar databases, and other “NewSQL” solutions, such as HANA, supplementing the traditional data warehouse where extremely low latency is required
- RDBMS systems, still providing a backbone of database technology for traditional ERP/CRM systems and for the standard star-schema data warehouse
The figure below illustrates this modern heterogeneous enterprise data architecture.
Clouds in the Forecast
Although the modern non-relational distributed architectures which inspired databases such as Cassandra and Hadoop were established by early pioneering cloud companies such as Amazon and Google, the movement of database workloads to the cloud has to date been relatively minor. Databases follow their application workloads into the cloud but rarely move to the cloud on their own. Issues of security and latency generally dictate that a database should be located close to its application servers.
Nevertheless, the movement of database workloads into the cloud is accelerating; an increasing amount of application workload is migrating to the cloud, and there is an increasing level of comfort with the idea of database as a service (DBaaS). The architecture of the cloud database systems is currently the same as those for on-premise systems. There are only a few databases—DynamoDB, for instance—that are available only as a cloud service.
However, traditional RDBMS systems have limited ability to run as clusters and cannot generally take advantage of the elastic resource provisioning made possible by the cloud. As cloud database deployments increase, it’s likely that there will be further pressure on the RDBMS.
The many compromises demanded by our current plethora of database technologies make selecting a database system far harder than it ought to be. Whichever DBMS you choose, you are probably going to experience a “not-quite-right” solution. Therefore, what we will see more than anything else over the next few years in the database marketplace is consolidation.
There are clear signs of this already: Virtually all database systems now support JSON in a native format and almost all systems—even those once called NoSQL—support the SQL query language. Columnar, in-memory, and graph technologies are now found across an increasingly broad range of database systems. As this convergence increases, we are likely to see more and more “multi-model” and “hybrid” database architectures. However, this convergence is probably going to play out over a very long period of time, and it’s hard to envisage an immediate future in which “one size fits all.”
For more insight into big data technologies, read the new Big Data Sourcebook