How ETL Powers the Data Fabric at Data Summit 2024

Using data fabric architectures to solve a slew of an organization’s operational problems is a popular—and powerful—avenue to pursue. Though acknowledged as a formidable enabler of enterprise data success, how do businesses effectively power these data fabric architectures?

At the annual Data Summit conference, the session “Data Fabric Key Enablers,” led by John Bagnall, senior product manager, Matillion, illustrated how ETL plays an integral role in data fabric implementation, enabling enterprises to easily and efficiently manage, integrate, and analyze data from diverse sources.

The annual Data Summit conference returned to Boston, May 8-9, 2024, with pre-conference workshops on May 7.

Bagnall offered a central thesis, explaining that “the ETL pipeline is the vehicle that drives successful data fabric strategy.”

As to why one might choose a data fabric, according to Bagnall, it eases the adoption of a new data platform, fitting it within an organization’s data context. It also separates data infrastructure from data usage, reduces the time to achieve insights, and enables the benefits of scalable data infrastructures that are located anywhere.

Returning to their thesis, Bagnall explained that ETL’s role in a data fabric boils down to these four pillars:

  1. Integration
  2. Quality
  3. Security
  4. Governance

Beginning with integration, in a data fabric architecture, data streaming and batch execution are integrated into the ETL pipeline. This allows for real-time processing and period batch analysis to coexist, overall enhancing agility and responsiveness across the organization. It also lends itself to scalability and flexibility, which is essential for data fabric integration. Automation and orchestration capabilities also help to streamline the integration process and minimize manual intervention.

As far as quality goes, data cleansing ensures that errors are corrected, data formats are standardized, and missing data is addressed by filling in the blanks or flagging records for review. ETL tools further incorporate data validation, verifying the accuracy, completeness, and consistency of data before business use.

Regarding the third pillar, security, ETL processes must ensure that data is encrypted when stored and securely transferred to end-user apps. Additionally, masking and tokenization aids in dynamically modifying the visibility of sensitive data based on user access level.

Finally, for governance, ETL tools offer data lineage functionality that systematically records the transformation and movement of data throughout its lifecycle.

Each of these pillars are powered by one central thing, Bagnall explained: metadata.

Types of metadata captured in ETL include:

  • Passive metadata, i.e., dataset names, source systems, creation data
  • Active metadata, i.e., usage statistics, performance metrics, data quality scores

Active metadata, Bagnall emphasized, is a core focus for the intersection of ETL and data fabric architectures. It reports back to the ETL tool for optimization, such as monitoring increasing data volumes and processing times, engaging in a circular flow and usage of critical data.

“[Active metadata is] about being able to optimize the pipeline itself,” said Bagnall. “It is the crux of ETL and data fabric.”

Once the pipelines are established, any new data should be seamlessly absorbed into the fabric. Ideally, it would resemble the following flow of steps:

  1. New data source introduced
  2. Orchestration pipeline extracts and loads
  3. Rules are applied
  4. Transformations completed
  5. Pipeline executed, metadata is created and pushed
  6. Semantic layer populated via cataloging and lineage

Like most areas, AI poses unique opportunities for an ETL-powered data fabric. Allowing the incorporation of unstructured data, detecting anomalies, as well as being able to use natural language to create a pipeline are a few examples.

Bagnall offered the following key takeaways when considering data fabric adoption:

  • Adapt existing data pipelines
  • Standardize ingestion protocols
  • Implement data profiling and quality checks
  • Enhance metadata management
  • Adhere to metadata standardization

Many Data Summit 2024 presentations are available for review at