Getting started with Data Build Tool (dbt) on Google Cloud

Data Build Tool (dbt)

dbt (Data Build Tool) is an open-source Python application that uses modular SQL queries to allow data engineers and analysts to transform data in their warehouses. It takes care of the ‘T’ in ETL procedures and handles the data engineering that comes before analytics. It does not handle the extraction or loading of data and assumes that the raw data is already loaded into the data warehouse.

dbt (also called dbt Core) has a CLI which can be run through a terminal program. The CLI (Command Line Interface) allows for most of the functionality: compiling SQL select statements, and running it against the user database to transform it. Additionally, dbt allows version control on the SQL modules, along with testing of the schema and data, introducing software engineering practices to the ELT space. dbt also has a UI that is for documentation-only purposes.

dbt Cloud is a web-based application with an IDE that helps data teams of all sizes develop, run, and version control dbt projects. This is a hosted service that offers various additional functionalities at a cost, like scheduling dbt jobs, viewing logs, sharing documentation, and CI/CD, all in an easily operable IDE. This appears to be the charged model of dbt which is otherwise a free, open-source tool.

Example

Here is a quick demo of how dbt works on CLI with data extracted and loaded into a data warehouse in BigQuery. We have created a sample project here, with an arbitrary sales dataset which can be found in the data folder. We load this data into BigQuery under the dataset springml_demo_dataset, and into the table sales.

Quick Demo of how DBT works

Setup

Next, we set up dbt on a terminal program by an easy install with pip or conda, and a new project can be initialized with dbt init. dbt creates a configuration folder at ~/.dbt/ where we set up the profiles.yml file which tells dbt how to connect to the user’s data warehouse. Here, we use BigQuery configurations, but dbt offers connections with Snowflake, Redshift, and more. BigQuery targets can be specified using one of three authentication methods. That can be found here- we have used the OAuth method.

The created project houses a number of folders or modules that make up the dbt project like models, tests, macros, analysis, docs, and logs. We configure the ~/.dbt/profiles.yml in the following way to create the profile SpringML.

 
springml:
    target: dev
    outputs:
        dev:
          type: bigquery
          method: oauth
          project: prismatic-cider-279015
          dataset: springml_demo_dataset
          location: US
 

The dbt_project.yml file provides more configurations to the project for run time, like referencing the profile, and specifying the source and test paths.

Models

Entering the models/base folder in our sample project, we have our model files named payments.sql and creditcard.sql that hold SQL select statements, which at run time will be applied to our data in BigQuery to transform it and create new views and tables. Here, payments.sql creates a simpler view of the payment types in the United States, and creditcard.sql arranges the data according to credit card payments by country.

select payment_type,
   count(*) as number_of_payments,
   sum(price) as total_price
from {{ source('springml_demo_dataset','sales') }}
where country like "United States"
group by payment_type
  

creditcard.sql uses Jinja, a templating language, whose integration with dbt makes it one of dbt’s most powerful features. Jinja allows users to implement programming methods that are not normally offered in SQL, like using control structures like if statements and for loops and using environment variables for production deployment.

{% set creditcards = ["Amex", "Mastercard", "Visa"] %}

select country,
{% for creditcard in creditcards %}
sum(case when payment_type = '{{creditcard}}' then 1 else 0 end) as {{creditcard}}_payments,
{% endfor %}
from {{ source('springml_demo_dataset','sales') }}
group by country
having not (Amex_payments = 0 and Mastercard_payments = 0 and  Visa_payments = 0)
order by (Amex_payments + Mastercard_payments + Visa_payments) desc
 

The source(‘springml_demo_dataset’,’sales’) function gives access to our BigQuery dataset making springml_demo_dataset.sales our source table. The schema.yml file for this model defines the schema of our source tables under source and the schema of the views to be created under models.

We now run dbt run -profile springml which applies the SQL modules above on the source tables, transforming the data to create (or replace) views, namely payments and credit cards, in BigQuery. The command-line tool executes in the following way, assessing models and tests, and creating views with respect to the target (here, there is only one target dev defined in the profile.yml). These views or data models are a result of transforming our original sales data, and they can now be used to serve a specific analytics purpose.

running with dbtRunning on the command line: dbt run -profile springml

Data build tool image 2

Data build tool image 3

Tests

dbt allows automated regression testing so that we can make sure our SQL modules are carrying out transformations correctly, and that the source data and the result data are in the desired format. dbt offers two types of testing, both of which can be run with the command dbt test -profile springml 

  • Schema tests: assertions like unique, not null, accepted values, and relationships that are added to .yml files under the model to be tested. Running these tests returns the number of records that do not pass a given assertion. Here, we included some schema tests in our schema.yml file. We declared the unique property under the column country in the model creditcard, which in the background returns the number of times a value in that column is repeated. Zero records returned means the test is passed
  • Data tests: custom SQL queries that can be found in the test directory, whose path is defined in the dbt_project.yml file. In this demo, we created a data test in the creditcardtest.sql file which ensures that the credit card model we created has at least one payment by any card in every country. On running this test, if zero records are returned, it means the test is passed.
select country
from {{ ref('creditcard' )}}
where Amex_payments = 0 and Mastercard_payments = 0 and  Visa_payments = 0

Running on command line dbt test

Running on the command line: dbt test -profile  springml

An individual model (or test) can be executed by including its name in the run command with the tag: -model <name> . We can troubleshoot by looking in the target/compiled folder which holds the actual SQL queries executed for the runs and tests.

Documentation

We can generate documentation of the entire project and the associated data warehouse, using dbt’s documentation tool by running dbt docs generate -profile springml. Furthermore, running dbt docs serve -port 8001 -profile springml opens up a web interface with a clear overview of all our source tables and transformations, as well as the data models and the way they are layered in the form of a DAG.

Springml demo dataset

springml demo dataset sales

Conclusion

dbt provides a unique experience to data engineers and analysts as it uses simple SQL select statements to carry out heavy transformations in a clean and cost-effective way. Its modules have competitively faster execution and are easy to modify and test. It allows layering of data models using the ref() function (as seen in creditcardtest.sql) which references views created and establishes dependencies between the data models that can be monitored in a clear lineage graph on dbt’s UI.

dbt’s integration with Jinja, along with its ability to version control and carry out automated testing, makes the process of data transformation more robust with the inclusion of software engineering principles. Overall, the data build tool does a great job of helping anyone comfortable with SQL manage analytics engineering workflows.

Please check out our video blog where we walk through a demonstration of the workflow above. Or check it out directly on YouTube:

For more information about how SpringML can support your team with data modernization projects, feel free to contact us at [email protected] or find out more about our services at: https://springml.com/cloud-migration/