<< back Page 3 of 3

The Importance of Data for Applications and AI


Within and across the variety of downstream use cases, it is inevitable that with a poor enterprise architecture, data copies will run rampant. Copying data doesn’t seem that bad at the beginning. It is easy, and it will probably only be done once or twice. Then the next person comes along and makes a copy too. You see where this is going: The larger the dataset, the bigger the problems because copies waste space and time. After a bunch of people have copied a variety of datasets, problems will arise when tracking back to true ownership of the data and even the original source of the data. Did this copy come from the original, or is it from Billy who made a copy and made some changes for his use case? Keeping track of lineage isn’t something that happens when individual users copy data, not because it isn’t possible, but more so because they don’t use the tools which plug into lineage tracking software. They make copies on an ad hoc, need-to-do-it now basis.

We haven’t even talked about copies between on-premise systems and cloud systems. This happens all the time, and those copies also have no lineage tracking; don’t even get me started on another security model either. Every time data is copied, there is a risk of data dependencies being created without any notion of who the consumer is, or even that a consumer exists. What most people either don’t know or haven’t taken the time to think about is the cost of those data dependencies. They are considerably more expensive than code dependencies, which at this point in time are well-understood; code dependencies are pretty easy to track, making it easy to manage change. The problem with data dependencies is that when upstream applications, tools, or jobs change the data structure, downstream software tends to break or produce unexpected behavior as a result of the upstream change. Seldom are these data dependencies known to upstream application owners, leading to vast amounts of wasted time tracking down the root cause of any problems.

The final area to consider here is that these downstream workloads are becoming more complex or varied. Batch-style workloads such as large-scale aggregation have been around a while and, when combined with real-time workloads, can lead to a lot of data movement. Adding to that, data science folks building models to create more intelligent feedback into the operational systems may depend on GPU hardware in order to solve problems in a reasonable period of time. The data from each of these three different types of workloads must be available to be leveraged by each of the other environments. Don’t forget about security again, or more data copies.

What’s Ahead

A focus on the underlying enterprise architecture is necessary for organizations to be able to achieve a data-first approach and reduce the mounting costs of technical debt. The only way to support the vast array of downstream use cases in the future is to use fewer point solutions. While a point solution is fine for the short term, it can easily cost far more in the time spent jockeying data around and managing the systems to deliver on the core capabilities required to deliver on the desired business outcomes.

In terms of enterprise architecture, here is what this all means: a reduction in delays for all downstream systems; reduction or prevention of data copying; reduction or removal of unnecessary ETL (using it only when required); support for real-time processes; support for versioning of everything (not just files); support for edge, on-premise, and cloud deployments with a single architecture; and—most of all—enablement of support of AI and analytics to solve problems and enable new revenue streams. The strategy needs to be this: Focus on the desired business outcomes, put data first, and pay off that high-interest technical debt.

<< back Page 3 of 3


Newsletters

Subscribe to Big Data Quarterly E-Edition