The Importance of DataOps

Data, data, data. With the exponential data growth that has occurred in the last 15 years, a need has arisen to sift through that data to find answers to questions.

Big data technologies, which could scale easily and handle vast amounts of data, began emerging around 2009 and haven’t stopped since. These technologies are maturing at a rapid pace, quickly dislodging many of the legacy technologies such as the traditional relational database management system. Accompanying these new technologies has been the emergence of “DevOps.” DevOps combines the operational aspect of the business with the developer’s knowledge to streamline the management of frequently changing systems that need to run at scale. New tools and techniques were devised to build, deploy, and manage these technologies, which has led to a revolution in how businesses manage their software pipeline.

There is now another shift occurring in this area called “DataOps.” While DevOps covered the software to run the operational side of the house, DataOps applies to the data side of the house. This includes the managing and versioning of data, data models, and the queries used to generate business intelligence, all of which depend on data availability at a specific point in time. Imagine all the SQL queries that have been created to drive dashboards and other key performance indicators used to make decisions within a business. When the underlying model changes, queries or dashboards could break. Data models being created to generate new knowledge for a company depend on a variety of data sources. These data sources grow rapidly, are not normally well-structured, and are cleaned and normalized after being landed. Fragility in the system quickly becomes apparent when combining all these details.

This new twist on DevOps is bringing together software engineers, systems administrators, and data scientists. Creation of a successful DataOps strategy is going to be critical to the long-term success of every business with a dependence on data. Whether the data is in files, event streams, or a database, the models being built to leverage the knowledge hidden deep within that data must be controlled in a manner that is commensurate with the rest of the software stack that runs the business.

An important word of caution: Do not be fooled when researching DataOps and think that source control alone is the answer. This is so far from the truth that it is scary. While a data scientist may write any variety of code such as R, Python, Scala, or even SQL, that is just code. The most critical detail is that all of that code has a very tight coupling to the data, the locations of the data, and the model of the data. If only the code is being controlled and versioned, then what happens when another dataset is added to the mix? What happens when the model of the data gets changed? These data models don’t traditionally operate over a couple hundred megabytes, or even a few gigabytes. They operate on tens, hundreds, or thousands of gigabytes or terabytes of data. The data the models use are traditionally stored in various systems, which ramps up the complexity of being able to version or snapshot data at a given point in time.

To ensure the highest level of clarity possible, DataOps is not just about managing data science-related pieces of work that are created to deliver business value. It is the combination of all of the data-related elements and all of the software to run the operations of the business. DevOps plus data (files, tables, streams, and the models using them) equals DataOps. This means that DataOps is not some myopic little thing. This has broad implications to bring continuity and agility to the business.

This is not easy. DevOps is not easy to begin with, and DataOps just makes it that much more difficult. In addition to the general level of complexity of the systems and processes involved, we can’t forget about the people. Software engineers, system administrators, and data science teams usually operate very differently from one another. This means there will likely be hurdles to overcome which are both operational and cultural in nature. Organizations that are looking to better deal with fast-growing data and big data technologies must also understand and adopt a DataOps approach to make these technologies successful to maximize the impact of data on their business.


Subscribe to Big Data Quarterly E-Edition