Apache Drill Brings SQL to NoSQL

It's been amusing to watch the NoSQL movement transition from a “We don’t need no stinking SQL” attitude to a “Can I please have some SQL with that?” philosophy. The nonrelational databases that emerged over the past 8 years initially offered no SQL capabilities. However, today we have an embarrassment of SQL options for “NoSQL.” Hive offers SQL for Hadoop systems, Spark has SparkSQL, MongoDB has a SQL-based BI connector, and so on.

However, in the modern enterprise with multiple heterogeneous database systems—some relational, some not—these SQL interfaces offer at best only a partial solution. What would be ideal is a SQL interface that could span the multiple disparate database systems typically in use: one that could, for instance, join data from Hadoop to data in Oracle.

The Apache Drill framework aims to provide just such a SQL engine. Drill can operate across multiple distributed data stores such as HDFS or Amazon S3, relational databases that support JDBC or ODBC, as well as NoSQL systems such as MongoDB and HBase. Drill’s architecture is based on Google’s Dremel system, which provides the foundation for the Google BigQuery product.

Drill incorporates a distributed heterogeneous cost-based optimizer which can intelligently distribute data access algorithms across multiple disparate systems. This allows a SQL query to span Hadoop, MongoDB, Oracle, or other databases and—at least in theory—to do so in an efficient and optimal manner.

Currently, Drill can query data from relational systems that have a JDBC or ODBC connector, from systems that are supported by Hive, from a variety of cloud-based distributed file systems (Amazon S3, Google Cloud Drive) and from MongoDB. One of the most important features of Drill is its ability to represent nonrelational data in a tabular format. This is obviously required in order to integrate data from systems such as MongoDB and HBase, where the tabular data model cannot directly represent the underlying JSON or wide column structures.

For instance, when querying data from MongoDB, array notation allows Drill to access individual elements of a JSON array. The FLATTEN function returns one row for every document in an embedded array. These two capabilities allow Drill to represent MongoDB structures in a tabular format.

Drill can also navigate wide column store structures in HBase. As with MongoDB, the first step is to FLATTEN the wide column families so that each element in the column family is returned as a new row. Another function—KVGEN—allows the resulting map to be converted to columns.

Drill’s ability to map into nonrelational systems such as MongoDB and HBase would be interesting in its own right, but these capabilities become really useful when working across multiple systems. Drill SQL can reference data in any number of back-end systems. So we might join data from HBase with data in Oracle, or data in Amazon S3 flat files with data in MongoDB.

Furthermore the Drill engine is inherently scalable. The Drill processing engine can be deployed on massive clusters if necessary, with SQL processing distributed across all the members of the cluster.

Finally, Drill integrates with third-party BI tools such as Tableau and Qlik, allowing these tools to benefit from its ability to access nonrelational systems and to efficiently join across multiple systems.

Drill is far from perfect. Mapping into nonrelational systems such as MongoDB or HBase requires some complex and nonobvious SQL syntactical gymnastics, and some very complex JSON structures are currently beyond Drill’s capabilities. Also, not all back ends can take advantage of the parallelism provided by Drill.

Nevertheless, Drill shows an enormous amount of promise. A single SQL framework capable of navigating the variety of data structures presented by relational and nonrelational systems is just what we need to resolve the Tower of Babel presented by the vast array of languages and interfaces brought about by the next-generation databases of today and the future.