In Data Management, the More You Do, the More There is to be Done

It is surprising how often IT management wishes to view projects as events that occur once and then are forever over. The necessity to maintain each new code module and every new data stream should be obvious; but apparently that obviousness leads to complete invisibility, at least as it relates to resource allocation. Certainly, upon ingestion of a new source as much as possible should be automated, allowing the new process to be self-maintained, as far as is practical. However, it is a rarity to find very much automation implemented on that self-maintaining front. Tools allow for processing to avoid failure under many circumstances; this means that a source may add columns and the ingestion process will keep running without an error raised.

Depending on specific circumstances, items such as new columns, will likely be ignored, but not always. New tables, or equivalents, potentially will be overlooked without notice, unless a specific process was created to proactively find such differences. Process errors most likely will never be raised except when a source makes a destructive change within itself, such as removing columns or dropping tables. Fortunately, sources should rarely, if ever, make destructive changes. Most applications approach new releases and upgrades in a fashion that makes change easier to handle. The sources accomplish this by adding columns or adding tables. By only adding data objects, most tools can simply ignore these changes and keep going. And such ignoring-of-change is the problem.

Over time the end result will be that the source system and the target system will be out of sync. For some circumstances, such out-of-sync-ness may be a moot issue. New data being available is fine, but until the solution has a specific need to use that new item it is inconsequential. For these systems, maintenance can remain focused on watchdogging for destructive changes. However, if the target system is a data lake that is intended to be the source for all downstream analytics using source data, being in sync with each source is of utmost importance.

With each source ingested, there is more work to be done—reviewing or looking for changes, updating target structures, updating code, attending meetings on possible future changes. These tasks are all a normal part of on-going maintenance for ingesting data from any source. For any one source, such maintenance tasks do not take very much time; but once you have five, ten, twenty, or more sources, even these few tasks can consume a significant amount of time. Multiple full-time head counts may need to be devoted to maintenance work. If management sees maintenance as a minor task that has no time devoted to it, where staff simply needs to fit it in between other work, problems can easily arise.

Each data feed coming into a data lake, and every process created within a solution, are virtual children that may require attention at any moment. Some children may be better suited to being left alone and not require too much attention, but all of them require some amount of attention. The more components built, of virtually anything, the greater the amount of time and energies necessary to be expended on maintenance and support. Thinking otherwise is like trying to believe it is not necessary to dust, vacuum, wash dishes/clothes, etc. unless and until one decides to build a new room onto one’s house. In order to protect your organization, it is critical to watch over the elements that have been built, keep processes running, and be on top of change. Spend the time and resources necessary to properly maintain the solutions for which you are responsible. The amount spent in such endeavors will be less time than that spent trying to play catch up on too many changes after bad things have resulted.