Google’s Newly Percolated Big Data Technologies

Google is the pioneer of big data.  Technologies such as Google File System (GFS), BigTable and MapReduce formed the basis for open source Hadoop, which, more than any other technology, has brought big data within reach of the modern enterprise.

Google devised these technologies initially to perform web indexing.  Crawlers would discover changes to web pages, which would be loaded into GFS and periodically indexed using long running MapReduce jobs.

This batch approach to web indexing worked well for Google and drove a lot of its web search advantage.  But the delay between a website changing, and that change appearing in a web search, might be too long for the ever more demanding consumer.  Increasingly, we expect changes to be reflected in real time.

To solve this web search latency problem Google implemented the Caffeine system.  Caffeine - built on an indexing scheme known as Percolater – allows web indexes to be rebuilt incrementally.  Rather that rescanning all crawled pages every time an index is built, Caffeine scans only newly crawled documents and merges the results into the existing index.

Percolater and Caffeine massively reduce the latency for building web search indexes.  But they don’t solve the issue of low latency querying of Big Data.  MapReduce jobs take minutes, at best, or hours, at the worst.  This is acceptable for some of the more advanced predictive analytic and machine learning tasks for which Hadoop is favored.  But, for data exploration or real time query, something more interactive is needed.

To meet these needs, Google introduced the Dremel query system.  Like the popular Hadoop Hive system, it uses a familiar SQL syntax.  However, where Hive translates SQL-like statements into MapReduce jobs, Dremel executes queries directly against the underlying data.   While MapReduce can operate against loosely-typed or undefined data, Dremel requires that the data be held in a structured format described as nested columnar storage

Column databases such as Vertica and Sybase IQ physically cluster data items by column, rather than by row.  This is done to optimize aggregate style queries, somewhat at the expense of row level operations.  Dremel’s data model stores data in columns, but also supports nested structures, which allow a more complex organization reminiscent of that used in document-oriented databases.

Dremel queries are distributed hierarchically across the servers of the cluster.  Root and intermediate servers route the query to “leaf servers,” which contain the raw data that is then passed back up the tree for aggregation and eventual return to the client.  To those who’ve worked with relational systems, it resembles a giant distributed B-tree index-organized table.

The Dremel SQL-like syntax supports simple SELECT statements together with aggregate and filter operations (GROUP BY, WHERE).  It supports only “small” joins, where one of the tables is tiny (under 8MB).  The “tiny” table can be propagated throughout each node of the query.

Dremel has been used in Google since 2006, and has spawned a couple of open source imitators such as Apache Drill - which is supported by Hadoop vendor MapR - and OpenDremel, though the latter appears to be inactive.

Google also has made Dremel available publically in its BigQuery service.   BigQuery allows users to load data into Google’s Dremel servers, and then interactively query them using Dremel SQL syntax. 

Dremel, Caffeine and Percolator represent significant steps beyond the simpler Google technologies that formed the basis for Hadoop, and they might well represent the next stages in the evolution of enterprise big data solutions.   I would predict, however, that these will be evolutionary influences on Hadoop, rather than competing solutions.