In this lesson we will:
- Consider the problems with the existing approach to ETL;
- Explain how dbt solves them and the benefits and value that dbt brings to data teams;
- Consider how dbt supports Software Engineering best practices.
Before discussing dbt in more detail, it is worth considering some of the problems with the existing approach and tooling before dbts arrival.
Some of the most common challenges that we see include:
- ETL processes are often fragile and unreliable;
- ETL tools are often GUI based and proprietary, require special skills and have high license fees;
- ETL processes are usually managed by centralised data teams which become a bottleneck and source of delay;
- ETL changes required by the business often end up being slow to implement which delays downstream data consumers;
- The traditional ETL approach does not support technical best practices such as testability, traceability, modularity etc.
In short, existing approaches to ETL lack speed, quality, agility and flexibility. At a time when businesses are looking to achieve more with their data, better approaches are certainly needed. Fortunately, dbt is one such tool to enable this.
Where historically, ETL was managed outside of the Data Warehouse using some third party proprietary tool, the main architectural change which dbt makes is to move the transformation step inside the Data Warehouse, where it can be executed after the data has been loaded. The process and the acronym therefore changes from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform).
This two character change sounds simple, but it has many positive implications:
- Carrying out the transformations within the database is simpler as there is less external technology to implement, manage and learn;
- Transformations can be defined in open standard SQL rather than some proprietary or GUI based tool;
- SQL based transformations can be checked into source control, tested, and turned into reusable components making the process more rigorous and repeatable;
- The transformations will likely be faster to run because they benefit from the full power of the warehouse;
- We can make use of the full capability of the database including the security model, user defined functions and cloud elasticity when building and running the transformation code;
- Both source data and transformed data can be retained in the database for different use cases.
This shift from ETL to ELT has been made viable by industry developments such a cheaper cloud storage such that provided by AWS S3, and the separation of storage and compute provided by databases such as Snowflake. Indeed, though dbt can be ran against many legacy data warehouses, it is these modern data stacks are where ELT and dbt is finding it's natural home.
In addition to being a step change in how we do ETL, dbt is said to bring the practices of software developers to the data realm, implementing a controlled software development lifecycle and set of patterns. For example:
- The scripts which describe the dbt transformations are based on SQL and stored as simple text files and configuration files. These can be therefore be stored in a source code management system, included in branching strategies and code reviews, and versioned properly so we have repeatable builds and deployments;
- dbt scripts can be ran using a lightweight command line interface. This could happen on the developers desktop, the analysts desktop, or incorporated into CI/CD deployment pipelines and ran with something like Jenkins. They are not trapped in some proprietary tool;
- dbt incorporates testing and assertion frameworks to ensure that the transformation runs successfully and produces correct and consistent idempotent results on each run;
- dbt models can be chained together into pipelines with dependencies, meaning that one step can only run if the proceeding one is successful. Again, this can be incorporated into the CI/CD process for increased reliability and robustness as we promote changes through environments;
- dbt is aware of environments and profiles, meaning that we can easily execute our scripts against development, test and production environments in a controlled way. (Integrating with something like Snowflake zero copy clones really makes production realistic testing of data transformations a reality);
- In addition to the automation and SDLC benefits, dbt also supports more collaborative ways of working and breaks the dependency and bottleneck on central data teams. For instance, because dbt is based around SQL and simple configuration files, we can break out of the central data team, and allow different teams to have access to and potentially take ownership of their dbt scripts.
All in all, dbt is a great tool for teams that want to manage their data using high quality engineering practices following a similar journey as software engineers have done over recent years.
- The model is properly scripted in a source control file, which can be source controlled and bought into the development process;
- Models can be chained together and executed in order where we have dependencies;
- The data in the model can be tested in a repeatable way after each dbt run;
- We can use the Jinja directives to add logic and intelligence into how our models are built, for instance building up models incrementally or choosing to materialise as tables and views;
- All of the above is standardised according to the dbt way. As dbt becomes a de-facto standard, any analytics engineer can immediately begin working with the project.