DataOps has emerged as an agile methodology to improve the speed and accuracy of analytics through new data management practices and processes—from data quality and integration to model deployment and management.
Traditional methodologies for handling data projects are too slow to handle the teams working with the technology. The DataOps Manifesto was created as a response, borrowing from the Agile Manifesto.
Eric Schiller, Senior Data Engineer, Excella expanded upon this concept during his Data Summit Connect presentation, “Data Engineering at the Speed of Software Development: Introducing the DataOps Manifesto”
To watch the video of Eric Shiller's presentation at Data Summit Connect 2020, go here.
Data management evolved separately from web development but is hitting similar walls, Schiller explained. Data projects usually need to know the whole picture first and moving data and changing it will be a huge investment.
“Data can be messy,” Schiller said.
This is where the DataOps Manifesto comes in, Schiller explained. Companies can apply what works from the past, define a starting point for discussion, and adjust as necessary.
The manifesto includes what big data professionals have come to value in analytics, including:
- Individuals and interactions over processes and tools
- Working analytics over comprehensive documentation
- Customer collaboration over contract negations
- Experimentation, iteration, and feedback over extensive upfront design
- Cross functional ownership of operations over siloed responsibilities
The Principles consist of continually satisfying your customer, valuing working analytics, and embracing change.
“This is really a balancing act in the data world,” Schiller said.
Additional principles of the manifesto include working on a team, daily interactions, the ability to self-organize, reducing heroism, and reflection.
“The DataOps mindset is really focused on getting the team involved, getting the organization involved with it,” Schiller said.
The principles continue to stress that analytics is code, it’s important to orchestrate, make it reproducible, have disposable environments, and embrace simplicity.
Analytics is manufacturing, quality is paramount, companies should monitor quality and performance, not be afraid to reuse tools or data, and improve cycle times, Schiller said.
“Keep a human in the loop,” Schiller said. “When dealing with machine learning, have someone take a look at every step of the process.”
To enact DataOps companies should test inputs (data) and outputs (logic), version control for data and artifacts, and branch and merge, according to Schiller.
Other good practices include branching your environments just like your code, utilizing containers, and “fake it til you make it.”
“Start working on a culture where you have highly performing data teams,” Schiller said.
Chris Bergh, CEO and head chef, DataKitchen, expanded upon these concepts with his presentation, “How to Get Started with DataOps Today.”
To watch the video of Chris Bergh's presentation at Data Summit Connect 2020, go here.
What you do is much less important than how you do it, Bergh said. What makes the machine is more important than the machine itself.
Companies implement DataOps because of business problems with slow development, too many errors, poor coordination, and no measurement.
To improve with DataOps create automated tests, count the errors, put all the ETL, model, and viz. code in git, and more, Bergh said.
The annual Data Summit conference is going digital this year with Data Summit Connect, from June 8 –June 11 due to the ongoing COVID-19 pandemic.
Webcast replays of Data Summit Connect presentations are available on the DBTA website at www.dbta.com/DBTA-Downloads/WhitePapers.