One of the many ways to move data quickly and efficiently across an analytics environment has been to process only changed data. This approach is generically referred to as change data capture (CDC) and incremental processing. The reason for adopting such tactics is that truly new or altered data is a small percentage of the whole. Therefore, processing only changed data can cut down greatly on the amount of data to be processed in a given cycle, or at least reduce processing needs at various points of the full cycle. For the final target processing, there will be fewer inserts, fewer updates, less work to be done, shorter cycle times, and lower resource needs.
But not everything will be simple. When a source has clearly identified change activity, processing can still become intricate. One source structure into one target structure is not how everything flows. Instead of starting and ending with a single structure, when one has, for example, three source structures that result in one target structure, how does that all work? Assuming a relational set of tables, one needs to have six source structures to make this work. Each source table needs one logical structure listing all change activity, and a second logical structure that shows the complete set of current data for each of those structures. The need for these two versions of the source are obvious. If we have three tables named A, B, and C, and three tables of just the associated change activity named A_delta, B_delta, and C_delta, a complete process only using changed data would need to pull A_delta joined to B and C UNION’ed with B_delta joined to A and C UNION’ed with C_delta joined to A and B. Simple. Of course, these two logical structures could physically be combined into one, as long as that structure can answer two questions: First, what has changed since the last time a process executed? And then, what does everything look like right now?
Should the source not provide a reliable means to know exactly what has been added, changed, or deleted, then those downstream folks desiring to obtain changed-data-only must create the data themselves. Generically, this means one must keep a copy of the data obtained from the previous cycle; then pull all the data in the current cycle; then compare value-by-value current rows to previous rows. The result of these comparisons will drive understanding of what is new, changed, and has disappeared. The comparison process is often made via a two-step approach. First, natural primary keys between current and previous are compared. Natural key values only existing inside the current version are add-activity. Natural key values only existing inside the previous version are delete-activity. For things that match between current and previous versions, then a second level of comparison must be made. Have any of the non-key values changed? Often this step can be simplified by hashing all these non-key values for a single compare versus an attribute-by-attribute compare. Differences found become the identified change-activity.
In this fashion, the CDC processing can become fairly complex. A truly functional data lake, data hub, or any other defined landing/staging area, if i it is supporting CDC, will need to support the capability for users to access any changes since a point-in-time, as well as a version of the whole current data store (representing the same point in time as the CDC content). CDC can greatly minimize the amount of data processed; but the cost is that the processes themselves become more complicated and overall storage may be higher. Costs are moved around, the final level of processing becomes focused on the minimal changes, and this minimization is the efficiency to be gained. Moving forward, using the data becomes standardized and ultimately straightforward.