For many years now, Cassandra has been renowned for its ability to handle massive scaling and global availability. Based on Amazon’s Dynamo, Cassandra implements a masterless architecture which allows database transactions to continue even when the database is subjected to massive network or data center disruption. Even in the circumstance in which two geographically separate data centers are completely isolated through a network outage, a Cassandra database may continue to operate in both geographies, reconciling conflicting transactions—albeit possibly imperfectly—when the outage is resolved.
As with all significant open source technologies, Cassandra has a commercial sponsor. DataStax is the primary supporter of open source Cassandra and also offers a commercial distribution, DataStax Enterprise (DSE). In a recent Forrester Report, DataStax was identified as one of the nine leaders in “Big Data NoSQL.”
The underlying data model of Cassandra is based on the Google BigTable wide-column model, which is powerful but notoriously hard to program. However, for some time, Cassandra has provided a SQL-like query language called Cassandra Query Language (CQL). CQL abstracts the underlying wide column structure in favor of a more traditional tabular format. Although the CQL language does not provide a join capability, an upcoming release will support GROUP BY.
While traditional database models are well tuned for dealing with “things,” graph databases excel when the relationships between things are of equal or greater significance—social networks being a familiar example. In 2015, DataStax acquired Aurelius software—makers of the TitanDB graph database— and has subsequently incorporated a graph capability into the DataStax Enterprise product.
Native graph databases such as Neo4J implement the graph relationships directly in the on-disk storage, which allows rapid traversal of a network without relying on index lookups. This is called index-free adjacency. In contrast, traditional databases often layer a Graph Compute Engine over their existing storage formats, which is fine when expanding a graph in batch mode but which often falls short for real-time operations.
The graph engine in Cassandra cleverly implements an index-free graph adjacency model on top of the existing Cassandra wide-column model, allowing the graph engine to benefit from Cassandra scalability and clustering capabilities while preserving fast graph traversals. The DSE graph capability is exposed through the open source Gremlin language. Neo4J uses an alternative language called Cypher, which has recently been open sourced. Gremlin and Cypher are likely to compete over the next few years to become the “SQL of graph.”
Cassandra’s support implements JSON over the top of the Cassandra wide-column data model. Top-level JSON attributes are mapped directly to Cassandra columns while nested items are implemented as Map and List structures within other columns. Unlike most other JSON implementations, the Cassandra approach requires definition of the JSON schema in a CREATE TABLE statement, which is consistent with Cassandra’s underlying approach but somewhat at odds with the flexible schema philosophy of many NoSQL advocates.
The DSE version of Cassandra offers a number of other significant advantages over open source Cassandra, including integration with Spark for analytics, Solr for text search, and operational and development tool suites.
Cassandra is virtually unmatched in terms of its ability to scale transactional activity on a global scale. DataStax continues to actively enhance both the open core and commercial faces of Cassandra, and it’s certain that Cassandra will remain a leader in the NoSQL market for the foreseeable future