Newsletters




Building Modern Data Pipelines Using dbt—Tips and Tricks


Earlier this year, I had the privilege of serving on the organizing committee for the DataTune conference in my hometown of Nashville, Tenn. Unlike many database-specific or platform-specific conferences, which tend to focus on analytics and data science, DataTune covers a wide variety of topics such as database engineering, development, admin­istration, and analytics. I was delighted to be joined on the committee by two friends, Dustin Dorsey and Cameron Cyr, authors of Unlocking dbt: Design and Deploy Transfor­mations in Your Cloud Data Warehouse.

Dustin, Cameron, and I spent more than a few moments dis­cussing the rising popularity of dbt (Data Build Tool). It was a conversation well worth having. dbt (notice that the lowercase letters are intentional) has gone from being an obscure field in 2021 to today’s de facto industry standard for data transformation and analytics engineering. dbt Labs has more than 4,000 custom­ers for its commercial dbt Cloud product, and more than 25,000 companies use its dbt open source product to create and manage data pipelines.

Why Use dbt?

According to Dustin and Cameron, dbt is a data transformation tool that provides only the “T” for users of ETL products (ETL means extract, transform, and load). Data professions are drawn to it because it yields several advantages, such as these:

  • Making existing data-dependent teams much faster. It’s easy to learn too. If you can write a basic SELECT statement in SQL, then you can use dbt.
  • Bringing software development best practices to writing SQL. Although originally developed for data analysts, it is a strong data engineering tool.
  • Providing strong, centralized control, such as source control, modular code reusability/abstraction, built-in testing, built-in documentation, standardization, integrated continuous delivery (ICD), GitHub integration, environment isolation, and data lineage.
  • Being open source. For most data engineers, the open source, free version (dbt Core) has everything you need.

Coding in dbt

dbt is a framework to run SQL anywhere in your cloud or on-prem data warehouse infrastruc­ture. It provides reusable and modular coding using macros in Jinja, a popular templating lan­guage used in the Python stack. As you change and evolve your data pipelines across time, you can track the data documentation and definitions using the data lineage graph feature. dbt Cloud also supports features such as triggers and scheduling to refresh data on whatever cadence is best for the business. This feature also means you won’t need an orchestration tool when using dbt Cloud.

Automated Testing

dbt comes prebuilt with automated testing for transformation tasks, including unique keys, NOT NULL columns, tests for refer­ential integrity, and accepted value testing. It’s easy to write bespoke tests using a combination of Jinja and SQL. To run those tests, all you need is a reference to the core YAML configuration file that is also used to define your documentation for a given table or schema.

The Importance of Modeling

The smart approach to data transformation boils down to hav­ing a strategy derived from data modeling. Your data model is crucial for code reuse, drilling down to granular data, and per­formance optimization. But dbt models are not simply the model of your data sources. Instead, they should be a more semantic version of existing data models, focusing on the language and 

structure of the business. Good modeling is essential to the struc­ture of your data pipeline project and to finding lasting success.

A key aspect of dbt semantic modeling rests on the use of incre­mental models. This feature allows you to efficiently update your data warehouse schemas by solely transforming and loading new and/ or changed data since the last dbt run. This way, you don’t have to process the entire dataset with each and every run; instead, you can incrementally append new rows or alter only changed rows using the IS_INCREMENTAL( ) macro. The bottom line is that data loading is much faster than before, as is the computational load of any given data pipeline.

Using the appropriate model, you can effectively support batch, near-real-time, and lambda data loading patterns. Having said that, dbt is, first and foremost, a batch-oriented tool and excels in those scenarios. However, when you want near real time and don’t want to learn something new such as Apache Kafka, you can apply lambda views to create a dbt project that combines both batch and stream process methods. These views provide the best of both worlds in an incremental data processing and strong his­torical context. You can also use materialized views for near-real-time data modeling.

Alternatives to dbt

Since dbt is primarily for batch-oriented transformations, you might want to explore alternatives for real-time transformations. Here are two of the popular tools in this category:

  • Apache Kafka is a distributed streaming platform that allows you to publish and subscribe to data streams in real time. Commonly, it serves as the foundation of many real-time data pipelines and event-driven architectures.
  • Apache Flink is an open source stream processing frame­work that provides low-latency, high-throughput processing of real-time data streams in scenarios such as event time processing, state management, and windowing operations in both batch and stream processing scenarios.

In addition, hyperscale cloud vendors each offer powerful, real-time streaming products that provide even more features, such as improved scalability, ease of use, and integration capabilities.

Common Pitfalls When Using dbt

Like many tools and strategies within the data and analytics space, the most common pitfalls in effectively using dbt usually involve a lack of deliberate planning or failure to follow a well-defined strategy and workflow. For example, dbt users might face these types of issues:

  • Overlooking data quality
  • Ignoring downstream dependencies
  • Lack of documentation
  • Complexity in SQL logic
  • Not leveraging version control
  • Inefficient queries
  • Inadequate testing
  • Ignoring performance considerations
  • Not monitoring dbt runs
  • Relying solely on incremental models, since they are not a one-size-fits-all approach

Just as most data analysis faces the issues listed above, so do dbt pipeline builders. However, these issues are easily solved by applying an extra bit of rigor and planning to the process to deliver long-lasting and resilient data pipelines.

Evaluating dbt

Is dbt a perfect data transformation product? No. Many experts using dbt have their own opinions about where dbt needs the most improvement. Having said that, it is one of the most promising new tools in the data engineering profession. Any­one resisting dbt is probably stuck in the older ways of building data pipelines or strongly attached to “old” ETL tools such as Talend, Oracle Data Integrator, Microsoft SSIS, and the like. When data professions have such a strong affiliation with more traditional tools and approaches, they often fail to see the big picture, which shows the acceleration and power dbt can bring to their organization.

If you don’t know anything more about dbt than what I’ve writ­ten here, take a look at the vendor documentation at www.getdbt.com/product/what-is-dbt. Even better, talk to the dbt community. The dbt official forum is located at https://discourse.getdbt.com. And you can register to join the dbt data engineering community at www.getdbt.com/community/join-the-community. Enjoy.


Sponsors