Run DBT 2x faster

Q: What is DBT (Data Build Tool)?

DBT (Data Build Tool) is an open-source framework for managing data transformations within cloud data warehouses. It allows data teams to write modular SQL queries, automatically handle dependencies between datasets, and execute jobs in a structured pipeline. DBT uses a Directed Acyclic Graph (DAG) to organize transformations, making it easier to manage complex data workflows and ensure reproducibility.

Published on

22 Jul 24

Big DBT jobs involve a lot of queries. Here's what a typical job might look like:

This representation of the queries in a job is called a DAG, or a directed acyclic graph. A DAG is a good way to represent dependencies between different queries. We can use it to understand how much work can be done in parallel, which in turn lets us speed up the job. In a typical DBT job most queries don't depend on each other, so we can run many queries in parallel to get big jobs done quickly.

That said, if you're a DBT user you'll know that big jobs are not exactly fast. With so much parallelizable work, why do you still have to wait hours for things to run?

It’s not DBT's fault - it's Snowflake's. While DBT knows how your job is structured, Snowflake doesn't, which means that when you schedule your jobs on a standard Snowflake warehouse you get a fixed amount of parallelism. You can, of course, speed things up by using a larger warehouse, but you’ll pay for the privilege.

You might think this isn't a problem if you pay for Snowflake’s multi-cluster support, but that's not always the case. While multicluster warehouses provide some scheduling flexibility, they still use a one-size-fits-all algorithm that doesn't take the structure of your DBT job into account. It turns out that you can do much better by analyzing your job.

Let’s take this simple job as an example:

With this job DBT will analyze the DAG, notice that A, B, D, and E have no dependencies, and schedule them immediately.

When run on a standard warehouse these queries split the resources of a single cluster, running in 12 minutes total. You should instead run A, B, D, and E on two separate warehouses, getting all four done in the first 4 minutes. This reduces the total runtime of the job by 33%, to 8 minutes.

A multi-cluster warehouse likely wouldn’t help here. Because Snowflake’s algorithm is unaware of your overall DBT job, it may just put all these queries on the same cluster, performing the same as the standard cluster. Or, it may put 3 queries on a single cluster and put the fourth one on its own cluster. This is faster than using one cluster (it would run in 10 minutes instead of 12), but it's also more expensive because the second cluster will be underutilized with just one query. Even worse, it could put all 4 queries on a cluster and then spin up a new cluster without even using it, costing you money with no improvement in runtime.

(This example assumes that your queries parallelize well on the cluster size you're using. Figuring out when that's true is a topic for a different post and what size cluster to use are topics for a different post.)

We’ve found that extending this analysis to optimize throughput on real DBT DAGs lets us improve runtimes significantly in both standard and multicluster warehouses. Our customers are using this in production today to run DBT jobs up to 50% faster with no increase in costs.

Want to try it out? Contact us: dbt-accelerator@espresso.ai

‍

Frequently Asked Questions

dbt is an open-source framework for managing data transformations within cloud data warehouses. It allows data teams to write modular SQL queries, automatically handle dependencies between datasets, and execute jobs in a structured pipeline. dbt uses a Directed Acyclic Graph (DAG) to organize transformations, making it easier to manage complex data workflows and ensure reproducibility.

A Directed Acyclic Graph (DAG) is a graph-based structure used in DBT to represent the dependencies between SQL queries or models. Each node is a query, and each directed edge indicates a dependency. Since the graph has no cycles, it clearly shows which queries can run in parallel and which must wait for upstream tasks to complete. DAGs are critical for optimizing data pipeline execution.

‍

A Multicluster Warehouse in Snowflake automatically scales by adding or removing compute clusters based on query load. However, it does not take into account the structure of DBT DAGs when scheduling queries. This can result in suboptimal resource allocation, where clusters are underutilized or spun up unnecessarily, increasing costs without proportionate performance gains.

‍

Query Parallelization is the process of running multiple SQL queries simultaneously to reduce the overall execution time of a DBT job. Since many DBT queries have no dependencies on each other, parallelizing them can significantly speed up data transformation pipelines. Effective parallelization requires understanding the DAG structure to ensure independent queries are executed concurrently without resource bottlenecks.