Newsletters




An Overview of Cassandra


In Greek mythology, Cassandra was granted the gift of prophesy, but cursed with an inability to convince others of her predictions - a sort of unbelievable "oracle," if you like.  Ironically, in the database world, the Cassandra system is fast becoming one of the most credible non-relational databases for production use - a believable alternative to Oracle and other relational databases.

Cassandra originated at Facebook, where it was used to optimize the Facebook inbox search facility.   Facebook open sourced the project in mid-2008, and it is now a top-level Apache project.  A variety of companies are using and contributing to Cassandra, including Digg, Twitter and Rackspace. 

Architecturally, Cassandra borrows features from both Amazon's Dynamo key-value store, and from Google's BigTable column store.  Dynamo is critical to Amazon's shopping cart implementation, and BigTable underlies many Google systems, including Maps and Reader.  While neither are open source, the designs for each have been shared publically.  

Like many NoSQL databases, Cassandra can implement eventual consistency rather than the strict or immediate consistency implemented by relational databases.  However, the Dynamo design allows Cassandra to provide configurable consistency.  Depending on the settings, Cassandra may exhibit weak, eventual or strict consistency semantics.   However, like most NoSQL databases, Cassandra cannot support multi-object transactions:  consistency is maintained, at best, only for a single data item. 

Google's BigTable model is the basis for Cassandra's data model.  Like BigTable, Cassandra data is organized into columns and column families.   Simple Column Families in Cassandra can resemble familiar RDBMS tables; however when Super Columns are involved, Column Families have complex internal structures including master-detail relationships that would require multiple tables in a relational database.

The Dynamo and BigTable architectures each promote unlimited and transparent scale-out as the database grows.  Cassandra is particularly suited to multi-datacenter geographically dispersed databases, and the design is expressly intended to support write-intensive applications.

The current version of Cassandra supports only a single key index - secondary indexes are planned for version 0.7 which is currently in beta.   Some Cassandra applications work around this absence by maintaining their own indexes built implemented as Column Families.  The more common pattern is to carefully construct the Cassandra data model in anticipation of the queries that will be generated - ensuring that all time-critical queries can be satisfied by a key lookup.  Both approaches have drawbacks; in particular, each involves the application to maintain redundant copies of data in multiple formats.  The possibility of data inconsistency exists, especially because of the inability to apply atomic transactions to multiple column families.

Cassandra pulled ahead of the NoSQL pack over the past year as prominent companies - including Digg, Twitter and Rackspace - announced intentions to deploy it as a critical part of their infrastructure.   Although Twitter later deferred plans to fully depreciate MySQL in favor of Cassandra, the open source Cassandra still remains one of the most prominently deployed non-relational databases.  

For a NoSQL database to cross the chasm to enterprise adoption requires both technical maturity and corporate support.  Technical maturity inevitably increases as database deployments grow.  Corporate support requires a commercial entity willing to stand behind the product to provide certified distributions, consulting and support services.  Earlier this year, well-known Cassandra technical guru Jonathan Ellis teamed with a fellow Rackspace employee to form Riptano - a company that aims to expedite and support Cassandra enterprise adoption.

Cassandra has technical credibility, some serious production deployments and an increasingly active community.   It's well placed to become a serious DBMS for massively scalable web and enterprise applications.


Sponsors