<< back Page 2 of 3 next >>

The Importance of Data for Applications and AI


Approach two lets the engineers find a technology that works for the solution and run with it. This would seem to be a good option if you are ready to start moving forward and away from all the existing legacy technologies. After all, there are benefits of these new technologies. Many of these new point solutions are purpose built for certain types of use cases and can deliver some phenomenal performance characteristics. Of course, it takes a little extra time to get these solutions into place, as backup and recovery practices should be documented and even practiced to ensure they work and are well understood.

Either of these two approaches is likely to be employed within any organization creating new software solutions. The problem is that they lack foresight. They are starting at the wrong end of the hierarchy of the business needs. Both approaches focus first on the software (one single use case), instead of on the outcome the organization is hoping to achieve. It is easy for a software developer to create an arbitrary data structure to store data for the purposes of getting data in and out of a database, but thinking further down the pipe is critical. How quickly data can be accessed by downstream systems is far more important than the initial use case and its requirement for persistence, which can be optimized with the correct database choice.

Approach three requires thinking about the data that is going to be generated, how it will be used by the business, and building a technology solution around those requirements. The first two approaches follow the standard path to building a new solution where data and the respective schemas are a byproduct of the software that has been created. This approach puts the focus on the data first. By concentrating on the data, there is a focus on the ways the data will be used. This helps to ensure that the data is structured in ways that downstream applications and use cases can consume the data in a minimally latent way.

Starting at the wrong end of the list of business requirements can lead to massive inefficiencies in downstream processes. These downstream applications could be search or recommendation engines, deep learning or machine learning models, analytics environments, or even other business applications. Without first considering the downstream use cases, each case will have that much more latency in between the data and action it may take. The complexity is enormous when considering that there could be numerous downstream use cases for any useful dataset. This could lead to a different extract-transform-load (ETL) process for each case, as well as a copy of the data for each use case. It doesn’t have to be this terrible, but in reality, this is how it is in many environments.

By not taking a data-first approach, operations and data science team members are effectively ignored. Left to their own devices, they will find ways to solve problems which will commonly be counter to the rest of the enterprise architecture. This of course leads to more complex systems with more potential points of breakage, more undeclared consumers, and more data dependencies. All too often, software architecture stops at the core use case, but complexities introduced downstream are part of the whole system and must be considered when contemplating the solution space.

It’s All About the Data

Constantly remind yourself that it is all about the data. Anything else will lead to a false sense of security. Speaking of security, with the recent implementation of the EU’s General Data Protection Regulation (GDPR), security is a critical issue that cannot be overlooked. While every organization is going to have a different set of rules and regulations to deal with, security is universally important, especially when moving into a production environment. Security is often an afterthought in the early stage designs of software systems and can lead to substantial redesign efforts if ignored.

Most businesses have some focus on real-time interactions or reactions. Real time can no longer be considered a second-class citizen, it must be promoted to first-class citizenship to support the rest of the architectural software stack. Systems must support access to data from a variety of tools for a variety of uses and provide full lineage support. Versioning, security, and at-scale access to all data is critical without endless copies being made. After all, every copy that is made must be tracked and secured—which is a fundamentally unrealistic goal as well as a time-consuming and morale-busting endeavor.

<< back Page 2 of 3 next >>


Newsletters

Subscribe to Big Data Quarterly E-Edition