It’s commonly asserted—and generally accepted—that the era of the “one-size-fits-all” database is over. We expect that enterprises will use a combination of database technologies to meet the distinct needs created by various application architectures.
For instance, most organizations still maintain a star-schema relational database for real-time data warehousing, but an increasing number also implement Hadoop, Spark, and related technologies as a big “data lake,” maintaining a large number of un-schematized and fine-grained data. “NewSQL” systems such as HANA are increasingly used for very high-speed analytics, and modern web applications are just as likely to use a NoSQL solution such as MongoDB or Cassandra as they are a traditional RDBMS such as MySQL or Oracle.
Even inside unified ecosystems there are “horses for courses”—technologies that have similar capabilities but are optimized for specific workloads. Hadoop Distributed File System (HDFS) and HBase represent such alternatives within the Hadoop ecosystem. Data stored directly on HDFS can be accessed by MapReduce-style programs, or through the SQL interface provided by Hive. Performance for bulk processing full scans of entire “tables” is optimized in this configuration, but it’s not possible to gain random access to individual records or update individual items without rewriting the entire file.
HBase is a database built on top of the Hadoop file system. It provides row-level read and write access and is optimized for high throughput OLTP-type workloads. Although HBase is built on top of the append-only Hadoop file system, it uses the Log Structured Merge (LSM) Tree architecture to support Create, Read, Update, Delete (CRUD) operations at the row level. Changes to any particular row are written as log entries in much the same way as a relational database writes to a transaction log. Periodically, these logs are compacted while in-memory representations of the most recent changes help minimize the I/O involved when retrieving a specific row.
Although this HBase architecture is well-optimized for individual row access, it does not serve well when an application needs to process all the rows in a table. Scanning a complete table typically involves reading multiple logs and reconstituting a view of the entire table.
Of course, many applications need to provide good performance for both row-level operations and for scans. To support such applications, Cloudera recently announced the Apache Kudu project. In zoology, kudu refers to two species of African antelope not necessarily the world’s fastest animal, but one that is robust and capable of living in diverse ecosystems.
Apache Kudu attempts to bridge the performance divide between HDFS and HBase. While not as fast as HDFS for scans, or as fast as HBase for OLTP workloads, it provides a good enough alternative to each for both scan and CRUD operations.
Kudu takes advantage of the relatively greater memory sizes available in modern hardware platforms and the availability of solid state drives for persistent storage. It uses columnar storage to optimize analytic-style aggregations and provide high compression levels. The data model, which is simplified compared to HBase, consists of fixed columns rather than the wide and dynamic columns of HBase. The underlying storage uses a modified version of the LSM Tree found in HBase and many other NoSQL systems.
Although Kudu does not offer a SQL implementation, it can be integrated with Cloudera’s Impala SQL layer to provide it with a limited ability to perform traditional SQL-based analytics.
Kudu is one of a couple of recent database systems that reflect an emerging trend of consolidation within the increasingly crowded and diverse database technology marketplace. When the NoSQL boom started about 7 years ago, each database was tailored specifically for a very distinct workload. Increasingly, we are seeing database vendors attempting to provide a more general purpose approach to their architectures. Kudu may or may not gain traction against alternatives such as HBase, Cassandra, and MongoDB. However, I do expect the consolidation trend to continue.