At Data Summit Connect Fall 2020, Ryan Gross, vice president of emerging technology at Pariveda Solutions, a management and technology consulting firm, explained how innovations are poised to drive process and cultural changes to data governance and lead to order-of-magnitude improvements in a session titled, "Embracing The Data Analytics Revolution."
Videos of presentations from Data Summit Connect Fall 2020, a free series of data management and analytics webinars presented by DBTA and Big Data Quarterly, are available for viewing on the DBTA YouTube channel.
Data governance is thought of as something that is not optional and must be accomplished for a variety of reasons such as compliance and security. However, according to research, only about a quarter of organizations say they feel that they are competent at it. Moreover, it is often looked upon as a hindrance to the capture of value to their analytics efforts, rather than something that is helpful, said Gross. This is because governance is driven from a place of fear, and fear-driven governance does not work.
The result is that data governance is the weakest pillar in a modern data enterprise, holding back the realization of value, said Gross. The problem is that most organizations' governance efforts are stuck in a loop, bounding from reactive (after events such as well-publicized data breaches and hefty fines) to uncontrolled (when their focus and attention on governance wains). These organizations need to move out of the continuous reactive/uncontrolled loop and into a proactive approach where governance is part of everyday activities.
The goal of data governance is to deliver on guideposts that have not changed over the years, explained Gross. These guideposts include data quality, data accessibility, compliance, availability, and security. But data governance today is still heavily dependent on manual intervention, with increasing numbers of data stewards and business-unit leads, which creates more business decision bottlenecks. This needs to change in order for companies to start to managing and governing data consistently. Given that there is a system problem, what is needed to fix it, is systemic change, said Gross.
To improve data governance, said Gross, we need to change our "mental model," and reframe the whole problem. We need to start applying the rigor that we apply retroactively beforehand, just as we have changed processes for DevOps-enabled software engineering practices with continuous software deployment now an accepted practice.
Gross looked at the origins of the DataOps movement and how they relate to the practices used for data governance.
DataOps, said Gross, combines Agile development, DevOps, and statistical process controls and applies them to data.
According to Gross, by thinking of our machine learning data pipelines as compilers that convert data into executable functions and leveraging data version control, data governance and engineering teams can engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, and other activities.
"We haven’t started to build on top of the capabilities that we have around versioning data and compiling it into intermediate forms," said Gross. "We don't have that concept of multiple environments with which to test, or the concept of ensuring that governance concerns like access management and privacy are baked into each stage of our data pipeline definition similar to what we do with ensuring that security best practices are baked into the way that we deploy our infrastructure and deploy our applications on that infrastructure."
Everything about your data platforms and pipelines should be re-deployable into any environment at any point, said Gross. By ensuring that every aspect of developing analytics solutions is captured and tracked as code, it becomes much more clear which change introduced a failure, said Gross.
Define everything as code to reduce risk and increase quality, and build trust, advised Gross, who identified the key components of a data pipeline as including the following:
- Pipeline Code: This code should be separated out so that pure business logic lives in a library and platform specific code calls the lib.
- Logic Tests: Test code to ensure proper function of the business logic. These capture edge cases that may not be in the real data.
- Cloud Environments: CloudFormation Templates define the infrastructure that will be created to deploy this component. Data cloning or test data management provides the datasets that enable testing.
- Deploy Pipeline: Jenkins File definitions include the pipeline or build steps required to successfully get this component into production.
- Data Tests: Test code that ensures input data is correct and outs are properly configured.
- Access & Privacy: Defines the requirements to access the data outputs produced by each pipeline stage.
- Dependencies: Defines all libraries that this component depends on to execute/test without actually including the libraries or SCM.
Gross also cited key vendors that provide technologies to facilitate building out a data pipeline, and where specifically they fit.