The “data lake” concept became popular following the explosion of interest in big data and Hadoop. The data lake was seen as a modern and more efficient alternative to the enterprise data warehouse (EDW).
Dissatisfaction with EDWs was common—particularly because of the delay involved in getting data into the EDW. Data in an EDW needs to conform to a strict data model—typically a “star” or “snowflake” schema. Transforming new data into this format required a lengthy manual data modeling process, followed by the establishment of an ETL (extract,transform, load) pipeline which would ensure newly created data found its way to the EDW.
The delay in establishing the ETL pipelines often became extreme, and the EDW—which was supposed to facilitate rapid decision making—often became a bottleneck.
One of the key advantages of Hadoop was that it was not necessary to determine the data model before loading the data. Hadoop could accept any data—structured or not. Of course, it remained necessary to determine the structure of the data before it could be utilized, but it was at least possible to capture data immediately. Furthermore, in a big data world, it was highly desirable for the original, raw, untransformed data to be kept available for future analyses. This was something the EDW could not support.
The new paradigm inspired by Hadoop was called “schema on read” as opposed to the traditional “schema on write” approach. Enterprises were encouraged to abandon the EDW in favor of a “data lake.” The data lake was a vast repository of structured and unstructured data that could be mined to achieve competitive advantage.
Unfortunately, the promise of the data lake—and, to some extent, Hadoop itself—was not realized. Although Hadoop succeeded in providing an economic mechanism for storing vast amounts of data, it did not provide a means for turning that data into knowledge. Additionally, it was often difficult to decipher data: Schema on read sounds great if you have a clear understanding of the structure of the data—and often this understanding was missing.
Despite the failed promises of the data lake, the concept retains some resonance in larger enterprises, and so MongoDB has chosen to leverage the term for one of its latest offerings. MongoDB's Atlas Data Lake bears only superficial similarity to Hadoop-powered data lakes. Nevertheless, it’s a useful feature that stands to see significant uptake.
MongoDB's data lake feature is perhaps more precisely similar to the “external table” feature which has been available in Oracle and other relational database management systems for a long time. The external table is a table whose data resides externally to the database. Likewise, in the Atlas Data Lake, a collection’s data resides not within a MongoDB database but as a file within a cloud object store—initially Amazon’s S3 service.
To create an Atlas Data Lake, we create a new special-purpose MongoDB server and supply that server with credentials allowing it to connect to one or more S3 buckets. Inside those S3 buckets can be files in Parquet, Avro, JSON, or CSV formats—optionally in compressed form. Collections are mapped to one or more of these files and can be queried using standard MongoDB find(), and aggregate() commands. The files are, of course read-only.
The Atlas Data Lake provides at least two significant advantages. First, it provides a way for data in object files to be queried using the familiar MongoDB syntax, without having to first load these files into MongoDB itself. Second, it potentially provides a very low cost storage mechanism that can be used to store “cold” MongoDB data. The downside is that indexing and query optimization features are not available.
Right now, MongoDB's data lake facility is only available in the Atlas cloud offering, and only available for data held in Amazon S3 buckets. It is expected to be soon offered on other cloud platforms and potentially even to on-premise deployments.