SQL in NoSQL Databases

The term “NoSQL” is widely acknowledged as an unfortunate and inaccurate tag for the non-relational databases that have emerged in the past five years. The databases that are associated with the NoSQL label have a wide variety of characteristics, but most reject the strict transactions and stringent relational model that are explicitly part of the relational design.    The ACID (Atomic-Consistent-Independent-Durable) transactions of the relational model make it virtually impossible to scale across data centers while maintaining high availability, and the fixed schemas defined by the relational model are often inappropriate in today’s world of unstructured and rapidly mutating data.

The Term "NoSQL" is Catchy but Misleading

SQL is a relatively innocent bystander, but, because SQL is a universal feature of the RDBMS, the term “NoSQL” seemed a catchy and convenient tag for these new breed of databases.  And, it’s true that, at least initially, these new databases shunned SQL as a query language, most notably because the SQL language implies support for the relational model through join and other operations.  Nevertheless, to avoid the misconception that NoSQL requires the absence of SQL, some in the non-relational community started describing NoSQL as “Not Only SQL” rather than “NO SQL.”

Yet, as non-relational database technology has evolved, many systems are finding support for some variation on SQL convenient, if not mandatory.  Hadoop is a notable case in point:  The support for SQL syntax in Hive opens up that system to a plethora of business intelligence tools.

Escaping the Limitations of a Fixed Schema

For many other NoSQL systems, support for SQL is rendered moot by their schema-free design.  Schemas essentially restrict and define the types of records that can be stored in a database.  For many developers, escaping the limitations of a fixed schema is a powerful motivation to use a non-relational system.   Key-value databases such as Riak treat each value as a binary object into which the developer may place any data whatsoever.   Document-oriented data stores such as MongoDB do require that data be stored in a specific format:  self-describing JSON-like documents (in the case of MongoDB).  But, providing that the data is properly formatted, the document database itself implies no particular data model.

The Google BigTable Model

Of all the NoSQL databases, those that descend from the Google BigTable model come closest to defining some form of schema.   BigTable formed the basis for many traditional Google applications and inspired many NoSQL databases such as HBase and Cassandra.   In BigTable databases, each “table” contains one of more column families. The contents of each column family are not enforced by BigTable, and many contain very large numbers of columns that vary from row to row.  Essentially, each column family may be thought of as a list of name-value pairs.

Cassandra adopted the column family design, but initially supplemented it with supercolumns, which essentially allow repeating groups of column families.   Column families and supercolumns allowed Cassandra data models to emulate many of the data models found in relational systems.    For instance, a Cassandra customer column family might include supercolumns containing all the orders for the customer. 

The syntax for working with Cassandra supercolumns was complex, and, in many respects, made operations that were logically fairly straightforward tedious and error-prone.  So, the Cassandra developers introduced Cassandra Query Language (CQL), which layers a familiar SQL language syntax over these complex operations. 

Using CQL, developers can create Cassandra tables using familiar SQL idioms.  Although relational concepts such as foreign keys and joins are not supported, secondary indexing and composite keys allow for many of the same operations.

The SQL language undoubtedly carries with it a lot of relational baggage, but, as the Cassandra folks have shown, there may be cases where SQL-like syntax provides a more productive interface, even for “NoSQL” databases.