How to Solve Data Downtime Through Data Observability

At Data Summit Connect 2021, Monte Carlo CEO and co-founder Barr Moses outlined the five pillars and key advantages of data observability in practice.   

According to Moses, solving data downtime, in other words, not flying blind, is easier than you might think. "All it takes is a holistic, automated approach to data observability," she explained. "So, what is data observability? Well, if you hadn't guessed it already, data observability is an organization's ability to fully understand the health of the data in their system. Data observability eliminates data downtime by applying best practices of DevOps and observability to data pipelines. Like its DevOps counterpart data observability uses automated monitoring, alerting and triaging to identify and evaluate data quality and discoverability issues.This, in turn, leads to more reliable data, healthier pipelines, more productive teams, and customers that can trust the data that your service is providing." 

Every data team has pillars of observability for data reliability and data trust, said Moses, who broke this down into five major areas of data health that are strong indicators of whether or not something has broken or gone wrong: freshness, distribution, volume, schema, and lineage.

"The first one, freshness, seeks to understand how up-to-date your data tables are as well as the cadence at which your tables are updated. Freshness is particularly important when it comes to decision making, as we all know that stale data is basically synonymous with time and money. How often have you been looking at a dashboard in Looker or Tableau or Mode and realized that the data is missing or outdated or does not represent the reality of the data flowing in your system? It has happened to the best of us. And so that's why freshness is one of the core pillars that you need to monitor and alert for when something veers from the norm." The next one is distribution. "Distribution tells you if your data is within an accepted range. Data distribution gives you insight into whether or not your data tables can be trusted based on what you can expect from your data.

"The third one is volume and it refers to the completeness of your data tables and offers insights into the health of your data sources. So, if 200 million rows suddenly turns into five million, you should probably know, and that could indicate that something is off. That's not to say that every time that happens, that means that something is off. But if you are aware of what's going on your system, you can adjust and alert the teams that need to know to fix things accordingly."

The fourth pillar is schema. "Schema refers to changes in the organization of your data. In other words, schema often indicates broken data, and monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem, not to mention preventing issues from occurring again. And so for data engineers, actually, we did a  quick survey among some of our customers and, data engineers say that the number-one cause of data issues for them is schema changes because schema changes can throw everything off in a pipeline, particularly one that kind of runs automatically—so that's something to keep in mind. And I'm sure you've dealt with them if you are a data engineer." 

The final pillar is lineage. "And, in my, in my opinion my favorite and the most important one. When data breaks, the first question that teams ask is always where, so data lineage provides the answer by telling you which upstream sources and downstream ingesters are impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the metadata—that speaks to governance, business, and technical guidelines with specific data tables. The  lineage serves as sort of a single source of truth for all data consumers and, in our opinion, lineage or metadata without the other is useless. Teams are collecting a lot of metadata these days, but if you're not able to apply it to a specific business use case, then what's the point—and lineage helps you with that."