Improving Data Quality in Data Lakes at Data Summit 2022

A common pattern in data lake and lakehouse design is structuring data into zones, with bronze, silver, and gold being typical labels. Each zone is suitable for different workloads and different consumers.

For instance, machine learning algorithms typically process against bronze or silver, while analytic dashboards often query gold. This prompts the question: Which layer is best suited for applying data quality rules and actions? The answer: All of them.

At Data Summit 2022, Stewart Bryson, chief customer officer, Qualytics discussed, “Mapping Data Quality Concerns to Data Lake Zones” during his presentation.

“There isn’t a lot of agreement about building zones and what we should call them,” Bryson said.

The data community is not in universal agreement about the structure and quality of data in the different zones. The difference between a data lake and data lakehouse is not relevant to the conversation.

These zones consist of:

  • Bronze -Raw data, no transformations, matches source, append-only
  • Silver-Business entities, simplified, denormalized, standardized
  • Gold-integrated, aggregated, elevated, “secret sauce”
  • Diamond-Published, products, applications, feature stores

The Qualytics 8 is a guide to how to distinguish the data lake zones, he explained.  This includes:

  • Completeness: Required fields are fully populated
  • Coverage: Availability and uniqueness of expected records
  • Conformity: Alignment of the content to the required standards, schemas, and formats
  • Consistency: The value is the same across all data stores within the organization
  • Accuracy: Your data represents the real-world values they are expected to model
  • Precision: Your data is the resolution that is expected
  • Timeliness: Data is available when expected
  • Volumetrics: Data has the same size and shape across similar cycles

Remediation of Data Quality issues starts with Enrichment. Enrichment is the process of exposing anomalies and the context around them so teams can take corrective actions in data pipelines. Fail Fast is just as true for Data Quality checks. The cost of correcting bad data increases the further it moves downstream.

“You need to be able to replay these changes in these downstream zones,” Bryson said.

The annual Data Summit conference returned in-person to Boston, May 17-18, 2022, with pre-conference workshops on May 16.

Many Data Summit 2022 presentations are available for review at