Two Key Questions When Planning a Staging Area for a Data Lake

In establishing a staging or landing area for a data lake, a data hub, or a quaint data warehouse environment, structures need to be established that will mimic source data in support of two very basic queries. First, “What does the current source dataset look like?” And, “What change activity has occurred against the source since the last time it was interrogated?”

As simple as these questions sound, supporting the ability to answer them clearly, quickly, and consistently can be quite involved. The incoming data may essentially be one of two kinds of data: it may be transactional, or it may be a snapshot.

A transactional source may inherently consist of adds, changes, and deletes, or it may consist of transactions for unique events that once they occur, they cannot change, like clicks on a website. These transactional sources naturally provide the answers to the what-has-changed question. Transactional sources of unique events quite naturally answer both questions, as nothing that has happened can ever change. 

Snapshots are related to batch processing approaches, and usually contain a full set of current data at the point-in-time that the snapshot is made. Therefore, snapshots excel at answering what was the source system’s current state. As one can imagine, a snapshot cannot address what has recently changed, unless the source system itself contains change dates encompassing the desired level of changes, albeit without old/new value tracking capability.

In general, answering the two basic staging structure questions concerns two differing logical structures. One is an exact copy of the source, and the second is a detailed listing of every change.  Except for the unique event transaction structure, wherein the one set of data serves both purposes, our sources directly provide one piece of the puzzle, and the other piece must be derived.

Transactions answer what-has-changed, but those same transactions must either be applied somewhere or the record representing the latest version of each natural key must be constantly identified. To identify changes for snapshots, the current version must be compared to the previous version and differences tracked and quantified. Certainly, the point-in-time nature of versions can only allow for one change per cycle, even when multiple changes may have occurred within that cycle time. Natural keys that pop into existence in the current version are “Adds”; and natural keys, disappearing from the current that were in the previous version, are “Deletes.”

Physically, a single structure may be created that contains the answers for both questions. This structure would be comprised of a part emulating the original source data, plus a designation of the row as added, changed from previous, or deleted, and lastly a datetime for the change activity. The primary key for such a structure could be a completely generated surrogate, or a composite of the source’s natural key plus the change datetime value. The potential issues with such a single structure concern performance. Those wanting a current view of the data would be weeding through all the changes to find the last view for each key. If implemented as a relational table, an awkward recursive join might occur. The clumsiness of the query might be diminished slightly if an additional attribute were added that flagged which is the current instance. But not only the initial processing is impacted; because rather than just inserting new records, updates must be made against the previous records to identify which are the current changes, with every change.

The alternative is to embrace two separate structures, i.e., one a copy of source, the other a list of changes. In this fashion, each structure can be optimized for the one question supported.  Notably, the downside to this second approach is the necessity of maintaining two distinct structures. One needs to pick one’s poison and balance performance limits and needs.