Cassandra and Hadoop - Strange Bedfellows or a Match Made in Heaven?

Hadoop is a leading solution for nonrelational data storage and processing. It is based on the key Google technologies for storing and processing huge data sets distributed across large clusters of commodity computers. Hadoop has shown rapid adoption in the enterprise, and is undoubtedly the leading technology in "big data" data processing.   

The foundation of the Hadoop system is the Hadoop Distributed Filesystem (HDFS) which can store massive distributed unstructured data sets. Data can be stored directly in HDFS, or it can be stored in a semi-structured format in HBase, which allows rapid record-level data access and is modeled after Google's BigTable system.

Cassandra is another nonrelational system that uses the BigTable data model, but employs Amazon's Dynamo scheme for data distribution and clustering.

Until now, Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions. However, Datastax-the commercial face of Cassandra-recently announced "Brisk," which merges the two technologies to provide a Hadoop distribution with superior availability and real-time capabilities. Brisk uses the Cassandra database to replace Hadoop's HDFS filesystem and the HBase database.

Both HBase and Cassandra can deal with large data sets, and provide high transaction rates and low latency lookups. Both allow map-reduce processing to be run against the database when aggregation or parallel processing is required. Why then, would a merge of Cassandra and Hadoop be a superior solution?

Answering that question risks igniting a flame war-HDFS, HBase and Cassandra have strong, smart advocates, and both have benefited from some very serious intellectual effort. Cassandra fans will generally claim that the Dynamo model of data clustering is inherently more scalable and reliable; Cassandra has no master node, and, hence, no single point of failure.  Some benchmarks show Casssandra achieving higher throughput than Hbase (though benchmarks showing the reverse exist, as well). Differences in locking mechanisms and in failure handling-whether writes are blocked or redistributed when a node fails-also differentiate the two.

HBase advocates will point out that HBase is more tightly integrated with the Hadoop core system, and is evolving rapidly. Many of the scalability and reliability issues associated with earlier releases of Hbase have been eliminated, and HBase is fully supported by Cloudera-the dominant commercial face of Hadoop. 

To be fair, HBase and Cassandra share similar features and design goals. Both have an extensible data model loosely based on Google's BigTable, and both aim for very high transaction rates on huge data sets. Both have had notable successes supporting seriously large applications, and both are advancing their technologies at an impressive rate. When used in conjunction with Hadoop and tools such as Hive (a SQL-like interface to Hadoop), both can provide transactional and massively parallel analytics for big data applications.

The relative success of Brisk will, of course, depend less on theoretical technical benefits, and more on mindshare, success stories, and the willingness of users to adopt a hybrid solution; but, it's  encouraging to see these kinds of experiments in the Hadoop ecosystem. A number of other emerging vendors with replacements or enhancements to the Hadoop stack are arising-MapR is another company planning to market a better HDFS, and companies like Cascading and Pervasive have higher level abstractions for running workloads on top of a Hadoop cluster. 

Regardless of how Brisk and these other experiments bear out, they testify to the incredible energy and innovation being produced by the Hadoop community. This pace of innovation bodes well for the future of Hadoop.