In this lesson we will:
- Explain how dbt supports testing of your models and transformation code;
Software Engineers will usually make use of automated testing techniques to verify the correctness of their code. This includes unit testing to test individual classes, and integration testing to test how modules interact end to end.
Historically, data teams haven't done as much automated testing, relying instead on manual testing phases before pushing changes to their data mangement logic into production.
Our experience shows that this hasn't been good enough to maintain quality. All too often, poor data is delivered into production environments. Rather than data teams identifying issues proactively, our users are the ones who report issues to us, and they lose confidence in what they are seeing.
Fortunately, with the rise of Data Engineering, automated testing of data pipelines is experiencing renewed interest. This involves not only testing the finished dataset, but also testing that souce data meets our expectations and that intermediate steps of our data transformation are build as we expect. And this isn't carried out as a one off, instead it is increasingly being incorporated to run on an ongoing basis with each new incremental batch of data.
This increase in testing is in no shortage because of dbt. dbt makes testing easy, and incorporates it as a first class concept in the workflow. If a team have adopted dbt to build their models, there really is no excuse not to extend this into the testing phase.
Tests in dbt involves querying our materialized models for records which break or violate our business rules. These tests are written in one of two ways.
Firstly, we can define them as properties, where they are defined in YAML and attached to our model definitions:
tables: - name: trips description: All of the trips taken in New York columns: - name: trip_id description: ID of the trip tests: - unique - not_null
These tests are very fast and easy to write.
Secondly, we can define our tests in SQL, testing for records which violate our business rules. For instance, if we should never have an order with a negative value, we could write a test:
select * from orders where value < 0
When we define a test as a property within our YAML files then dbt will build the sql for us during it's compile phase. This means that all tests are effectively sql tests - we just define them in a different way.
There are two types of tests referred to as singular and gneric tests.
Singular tests are the simplest kind. They involve simply writing a query to identify any failing records. This query is placed in a file and saved in your test directory.
select * from orders where value < 0
Generic tests are written to test certain characteristics of your data. Rather than being placed in one test file, they can be parameterised and referenced by any model in your codebase.
Generic test example goes here
Generic and singular tests will be explained in more detail in subsquent lessons.
A good Data Engineering workflow should incorporate testing as part of your deployments.
Ideally, these tests would be run in multiple places in order to get the most value from them:
- On the the developers desktop as they are developing the code for their transformations.
- In a shared development or testing environment to ensure that we aren't introducing integration errors
- In a test environment, with production like data;
- In production, after the model has been deployed.
At each stage, we increase our confidence that the transformations we are implementing are operating as expected. This will ultimately drive up quality in production and decrease the risk of providing bad or broken data to our stakeholders.