Data Observability and DataOps
It’s not always the capabilities or features that determine the difference between solutions, it’s the purpose and the problem they were summoned to solve.
The issue that arises with big data is that with more data sources, higher complexity of transformations, and speed of data flows, there is more surface area to maintain and less time to apply fixes without visible impact. As maintenance overhead increases, data issues become more frequent and severe. Subsequently, when a data pipeline breaks down, the use cases that depend on data begin to falter.
In most cases, this leads to degrading trust in data products. Therefore, creating more data without increasing the capacity to manage it only leads to greater chaos.
Having visibility into the ever-expanding and accelerating data landscape and making it manageable has never been the context for data quality discipline, to say the least; this is the context for DataOps.
Although data observability capabilities resemble those of data quality management, they address a more complex DataOps problem of visibility over an ever-growing amount of data workflows and expanding data landscape. The fundamental challenge now is to get timely insight into a data landscape state, especially when complex cloud-native ecosystems built upon massive data volumes make it even more difficult to predict a system’s behavior.
Data Observability Platform Capabilities
Data observability provides data teams with the ability to measure the health and usage of data within their pipelines, as well as health indicators of the overall ecosystem.
It uses automated logging and tracing information that interprets the health of datasets and pipelines and enables data engineers and data analysts to identify and be alerted about data related issues across the data ecosystem. This makes it much faster and easier to resolve data-induced issues at their root rather than just patching issues as they arise.
Data observability platforms are enabled by several main components and capabilities:
- Data pipeline tracking includes time of completion/failure, time it took to process data, and the
amount of data being added/deleted/ changed at a given time frame. Any changes detected by manually set thresholds or determined by an ML engine should be reported via issue tracking systems, with responsible people alerted through the respective channel. This type of tracking does not include the infrastructure and applications, which support data pipelines. Infrastructure and applications observability and reliability have their own methodology and are out of scope for data observability. For data observability, it is important to know when a pipeline has failed and if the reason is data related.
- Data quality assessment starts with high level metrics as data volume and frequency of updates to field level metrics. Data quality metrics are always context-dependent, but these are necessary to start with:
- Freshness or how often each table is updated and alerting if there is a delay
- Data volume changes for a table and alerting if a table grows or shrinks unexpectedly
- Schema changes, which can include added, removed, or updated fields and deleted tables
- Field metrics derived from profiling, such as null or empty values, min, max, median, etc.
Basic metrics can be extended with more complex:
- Field patterns, which in a simple case can be expressed via a regexp, e.g., phone, account number, or email
- Custom business rules, which can be for a particular field, cross-fields, or cross-table and require complex SQL or other means of verification
Automatic ML-driven adaptive rules and thresholds can be used over time, when the engine has had enough data and iterations to learn what “normal” means for a given field.
- Data lineage of the data entities being monitored is necessary to show data engineers where exactly the issue occurred, the potential impact on data products, and why the issue might have happened (potential areas of failure upstream). Combined with data quality assessment results visually laid over the lineage graph at a field level will provide an extremely effective and easy-to-use tool to understand the pipeline health and visually detect problem zones.
- Machine learning engine provides anomaly detection automation, data quality rules suggestion, and self-adaptation over time. It should be backed up by the option of manual intervention and indication of falsely triggered alerts and incorrectly created business rules, which ML will use to learn and improve.
- Data issues tracking is an essential part of the platform. It will accumulate a history of incidents, provide statistics for teams about the accuracy of the detection engine, and enable them to analyze the effectiveness of the resolution methods applied. This history will also serve as an input to the continuously adapting ML engine. Issue resolution workflow, however, can be either part of the platform or integration with a third party.
- Incident alerting via multiple channels is necessary to notify data teams and users in case any conditions are breached or anomalies detected.
- Connectivity enables ways to collect metadata and monitor data in various modern stack data storage solutions. There should be a possibility to extend connectivity to any storage via SDK and APIs.
- APIs are a must-have for a platform to provide flexible interaction and create input/output integrations with third-party tools.
- Security mechanisms should allow managing permissions for data observability platform access, set-up, and the respective metadata visibility authorization for different groups of users and roles in the system. User management tools integration should be possible.
Since data observability is a relatively new area in the data space, these capabilities are going to evolve over time, but for now these are enough to enhance the trustworthiness of data pipelines. Combined with proven DataOps practices, data observability can help to truly create more reliable workflows and enable the delivery of timely and accurate data on a larger scale to provide a safety net necessary to spur greater data-driven innovation.