In this lesson we will:
- Create our first dbt project and show how it can be administered and validated at the command line.
We can begin by creating a new project with the dbt init command.
dbt init pizzastore_analytics
The CLI will ask which database the project will connect to. The list presented will depend on which specific connectors you have installed.
Which database would you like to use?  clickhouse  postgres
For now enter number 2 to select postgres.
Creating the project should give a succesfull output such as:
Your new dbt project "pizzastore_analytics" was created! For more information on how to configure the profiles.yml file, please consult the dbt documentation here: https://docs.getdbt.com/docs/configure-your-profile One more thing: Need help? Don't hesitate to reach out to us via GitHub issues or on Slack: https://community.getdbt.com/ Happy modeling!
As the output points out, one of our important next steps is to configure the dbt profile to describe which data warehouse to connect to and how.
In this instance, because we are using Postgres, we would need to set the connection details for our Postgres instance such as the host, port, username and password.
However, before doing that, lets first explore the project structure which has just been created:
cd pizzastore_analytics ls -la
Which should give us something like this:
drwxr-sr-x 8 root root 4096 Nov 24 15:06 .thoroug drwxr-xr-x 1 root root 4096 Dec 6 14:02 .. -rw-r--r-- 1 root root 28 Nov 24 15:06 .gitignore -rw-r--r-- 1 root root 571 Nov 24 15:06 README.md drwxr-sr-x 2 root root 4096 Nov 24 15:06 analysis drwxr-sr-x 2 root root 4096 Nov 24 15:06 data -rw-r--r-- 1 root root 1339 Nov 24 15:06 dbt_project.yml drwxr-sr-x 2 root root 4096 Nov 24 15:06 macros drwxr-sr-x 3 root root 4096 Nov 24 15:06 models drwxr-sr-x 2 root root 4096 Nov 24 15:06 snapshots drwxr-sr-x 2 root root 4096 Nov 24 15:06 tests
The folders created here contain the following:
- Models - Models are your core transformations which take source data and output destination tables or views;
- Analysis - Analyses are temporary models which we need as intermediate steps, but which are not persisted to the database;
- Data - Sometimes you will need static and seed data for your transformations;
- Macros - Macros are reusable code blocks which are used by multiple transformations, giving us reuse;
- Snapshots - Snpashots of the database state;
- Tests - Tests define the correctness of your transformation.
We will learn more about all of these elements of dbt thorough the remainder of the course, but it is worth familiarising yourself with the project structure at this stage.
The dbt_project.yml is a configuration file in the root of the project, and allows to specify project level details such as the paths where various components are found.
Edit the file using the following command:
And you will be presented with the following file:
name: 'my_new_project' version: '1.0.0' config-version: 2 profile: 'default' source-paths: ["models"] analysis-paths: ["analysis"] test-paths: ["tests"] data-paths: ["data"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by `dbt clean` - "target" - "dbt_modules" models: my_new_project: example: +materialized: view
It is worth replacing the project name my_new_project with our project name pizzastore_analytics. For some reason, dbt does not set this automatically.
The main reason we would usually touch this file is to set model specific configuration in the models: section. However, there are multiple and better ways to configure your models, meaning that this file is not always used.
For now, we can exit the file.
dbt debug is a useful tool for running at this stage to check that your project is configured correctly and that you have connectivity to the databases in the linked profile.
Running with dbt=0.21.0 dbt version: 0.21.0 python version: 3.8.10 python path: /usr/bin/python3 os info: Linux-5.10.47-linuxkit-x86_64-with-glibc2.29 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /ecommerce_analytics/dbt_project.yml Configuration: profiles.yml file [ERROR invalid] dbt_project.yml file [OK found and valid] Required dependencies: - git [ERROR] 2 checks failed: Profile loading failed for the following reason: Runtime Error Credentials in profile "default", target "dev" invalid: ['dbname'] is not of type 'string'
As you can see, this provided a useful summary of the project and attempted to connect to the database to validate the connectivity details. Because we haven't yet configured our database or profile, we see an error.
So we can progress, lets enter some database credentials into the profiles.yml file.
Delete the existing content and paste these details into the file:
default: outputs: dev: type: postgres threads: 1 host: localhost port: 5432 user: postgres pass: postgres dbname: pizzastore schema: dev_pizzastore prod: type: postgres threads: 1 host: localhost port: 5432 user: postgres pass: postgres dbname: pizzastore schema: prod_pizzastore target: dev
We will explain the contents of this file in more details in the next lesson. For now, this step allows us to proceed with the lesson by having a well formed profiles file.
dbt parse is a useful command to use during your development workflow.
It will verify that all of your source code is correctly structured, including your profiles.yml file that we just populated.
Integrating frequent calls to dbt parse into your workflow can speed up the development cycle rather than waiting for long migrations to run only to find that you accidently introduced a typo into one of your transformation files.
Running with dbt=0.21.0 14:28:13 | Start parsing. 14:28:13 | Dependencies loaded 14:28:13 | ManifestLoader created 14:28:14 | Manifest loaded 14:28:14 | Manifest checked 14:28:14 | Flat graph built 14:28:14 | Manifest loaded 14:28:14 | Performance info: target/perf_info.json 14:28:14 | Done.
As you can see, some timing information is also included which may be useful if you have very large projecsts and wish to optimise the build time.
dbt ls lists all of the "resources" defined in your model, where resources are things like models, tests, analysis, seed data. This can be a good way to get an overview of the entire project and to ensure that your expected resources are correctly identified by dbt.
my_new_project.example.my_first_dbt_model my_new_project.example.my_second_dbt_model my_new_project.schema_test.not_null_my_first_dbt_model_id my_new_project.schema_test.not_null_my_second_dbt_model_id my_new_project.schema_test.unique_my_first_dbt_model_id my_new_project.schema_test.unique_my_second_dbt_model_id
These resources are examples transformations and tests which were created when we ran dbt init. They are useful to copy from as you begin to build out your own code, but will likely be deleted early in the process.