DataOps: Modernizing BI With DevOps for Data Analytics

By Pat Patterson

May 16, 2019

Over the past decade, the push for digital transformation has touched nearly every industry and has changed the game for BI. Now, every system and device has a digital trail, with data varying in structure and delivery speed. Fit-for-purpose data platforms allow for unlimited raw data storage and can live on the edge, in data centers or in the cloud as a hosted or managed service. Stream compute platforms enable processing of real-time interaction data. Lastly, data prep and advanced analytic techniques are available as self-service tools to make data scientists more effective, and also as automated processes to enable AI.

This technology revolution has made possible a true power-up of data analytics beyond BI, whereby any person, application, or machine can analyze and act upon data at the “speed of need.” It democratizes the use of data, expands the range of available data, and offers new techniques for triggering data-driven action.

Achieving this as data architectures become more complex and change more frequently requires that a new operational mindset be applied to data management. In particular, automating the building and maintenance of data pipelines is needed, as is instrumenting and continuously monitoring pipeline performance to ensure reliability and quality for data consumers. We call this practice “DataOps.”

The Challenge: Complexity, Blindness, and Change

Modernizing analytics is challenging for three reasons:

Architectural complexity is growing geometrically as enterprises evolve from a one-stop-shop data warehouse for BI to a diversity of fit-for-purpose systems that are not centrally controlled and need to share data with one another.
Traditional data integration tooling, hand-coding, or system-specific frameworks are poorly instrumented and cannot provide fine-grained runtime metrics, creating operational blindness—an inability to track dataflows and detect problems in data flow logic, infrastructure systems, or the data itself.
The rate of change—“data drift”—is increasing exponentially due to more systems, more data diversity, more users, and fragmented change management. Data drift is defined as unexpected, unannounced, and unending changes to data structure, infrastructure, and semantics. Left unaddressed, it pollutes data and stops analytic workflows.

The implications of these three challenges are runaway costs, lost opportunities, and increased operational and data privacy risk. Project overruns, delays, and failures result from a lack of available data engineers who can bridge multiple data movement technologies, not to mention the constant maintenance required to address data drift across these methods. Lost business opportunities are the cost of insufficient capacity due to inefficiency and fire-fighting. Operational efficiency suffers as data drift breaks pipelines, harms end-to-end monitoring that traverses all tools and lengthens root-cause determination and time-to-resolution. Data privacy compliance risk results from the lack of real-time metadata about what data is moving where, combined with frequent changes to data sources, infrastructure, and data movement logic.

Modern Analytics Demands DataOps

To overcome these obstacles, a new approach is required, one that’s focused on improving engineering productivity, operational efficiency, and architectural agility—all of which improve business confidence in the completeness and quality of the data.

The challenges listed earlier—complexity, blindness, and change—are less attributes of a specific system or data source and more the characteristics of the modern data architecture: It’s the complexity, rate of change, and increased speed of execution that make blindness deadly. For this reason, the challenges are data integration issues, as they are tied to the movement of the data rather than the systems themselves.

A decade ago, the software engineering space went through a similar crisis and transition. The pressures of developing and maintaining modern applications broke the traditional waterfall method and ushered in the DevOps revolution, enabling an agile process with automation and monitoring. Automation enabled smaller, more frequent, and higher-quality application updates, as well as code standardization and reuse. Monitoring helped accelerate root cause determination and allowed enforcement of application SLAs.

DataOps is the application of DevOps principles—specifically, automation and monitoring—to modern data analytics with the goal of reducing the analytics delivery cycle time in support of key business objectives. In the same way that DevOps helps companies deliver better applications more quickly, DataOps brings these benefits to data integration in the service of analytics, enabling companies to integrate data flow design and operations into a continuous and agile process.

Why Automation and Monitoring Are Key to Data Integration

Data integration requires two distinct capabilities: automation and monitoring. We automate tasks to reduce manual labor, accelerate delivery, reduce delivery cost, and improve quality. Modern analytics depends on building data pipelines quickly while employing standardized best practices, and agility in adapting pipelines in response to changes in data sources, user requirements, and storage/compute platforms. Without automation, data movement is labor-intensive, requires specialized skills, and is prone to defects at every point in the iterative build-deploy-operate cycle.

Monitoring matters most when adequate performance is critical but is hard to characterize beforehand and when unexpected performance problems arise. For modern analytics, performance encompasses the speed of data delivery, the health of the data, and whether it is consumption-ready. The urgency with which real-time data is consumed, as well as the impact data drift has on data health, makes continuous monitoring of every point in a pipeline critical to the performance of the application or process relying on the data.

Automation and monitoring overcome the three key challenges—complexity, blindness, and change—raised earlier.

Complexity—Automating the build-deploy iterative cycle for data pipelines helps address the many-to-many diversity of modern data architectures more quickly and cost-effectively.
Blindness—Monitoring at each point of a pipeline or multi-pipeline data movement topology, as well as across every segment of data movement, shines a light on baseline performance and incipient performance issues.
Change—Automating the steps to implementing planned changes, coupled with monitoring to detect unexpected changes, allows for a smoother operation despite a more fluid environment.

Data Integration With a DataOps Mindset

What does it mean to apply the core DevOps principles of automation and monitoring to data integration?

First, automation can be applied to the iterative build/operate lifecycle of a dataflow architecture in order to improve productivity, enable reuse of pipeline logic, and adapt to changes quickly. Parts of this lifecycle are pipeline development, deployment and scale; continuous integration/continuous deployment; and SLA enforcement.

Continually monitoring all aspects of data movement for performance across multiple dimensions such as data delivery, quality, and privacy creates the feedback loop required for agility and high reliability. “All aspects” of data movement refers to data flow performance, data characteristics, data use, data flow logic, and data flow use. The combination of automation and monitoring enables faster delivery of projects, higher operational efficiency, and lower governance risk.

Self-service analytics tools, combined with the digitization of everything, open up exciting new vistas for enterprises to make better decisions and take impactful action more quickly. But it also puts pressure on the data supply chain by increasing the number of moving parts in the system, the rate of change of data and analytics requirements, and the urgency with which consumable data must be provided to users and applications. DataOps leverages the concepts pioneered successfully by the DevOps movement to provide a framework for data practitioners to become more operationally focused and agile.