In this lesson we will:
- Learn about dbts profiles;
- Demonstrate how to configure them using the profiles.yml file.
The main purpose of dbt profiles is to describe and configure the connection details for the databases that you work with, including the database type, hostname, username and password for instance.
In some cases, you may only work with one database. In others, you might have multiple projects that you are working on, multiple development, test and production environments, and even different global instances of these environments. In these situations, profiles become a more important piece of your dbt deployment.
In addition, the dbt profile allows you to configure minor local settings such as your log file paths and how dbt creates it's artifacts which you may need to vary on a case by case basis.
Every dbt project has an associated profile, which is specified in the dbt_config.yml file at the root of your dbt project.
# This setting configures which "profile" dbt uses for this project. profile: 'ecommerce_analytics'
This entry refers to a profile that is configured in a file in the developers home directory:
If we were to create a new project named ecommerce_analytics, and choose Postgres as the destination connector, a dbt profile would be created in the profiles.yml folder with placeholder for development and test databases.
ecommerce_analytics: outputs: dev: type: postgres threads: [1 or more] host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: [dbname] schema: [dev_schema] prod: type: postgres threads: [1 or more] host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: [dbname] schema: [prod_schema] target: dev
If we were to create a second dbt project on the local machine, a second entry would be created and appended to the profile file. In the example below we created a project named ecommerce_finance and requested a Snowflake connection.
ecommerce_finance: outputs: dev: # User-Password config type: snowflake account: [account id + region (if applicable)] user: [username] password: [password] role: [user role] database: [database name] warehouse: [warehouse name] schema: [dbt schema] threads: [1 or more] client_session_keep_alive: False prod: # Keypair config type: snowflake account: [account id + region (if applicable)] user: [username] role: [user role] private_key_path: [path/to/private.key] private_key_passphrase: [passphrase for the private key, if key is encrypted] database: [database name] warehouse: [warehouse name] schema: [dbt schema] threads: [1 or more] client_session_keep_alive: False
As you can see, placeholders are created in square brackets which the developer would use to specify the connection details.
Many people new to dbt would expect to include the profile details with the project. However, there are a number of good reasons for managing them outside of the model code:
Profiles may be specific to an individual user. You may for instance have the common workflow where seperate developers work in their own databases during development, then progress to shared databases during test and production. In this case, developers need the ability to run their dbt models against their own database instance;
Profiles may contain sensisitive information such as usernames and passwords. It therefore makes sense to maintain them outside of the main source control repository and store in the developers home folder or another secure location.
The connection details are a seperate concern to the model code which you write. You may for instance wish to run the same set of models in seperate development, test and production accounts and in seperate database instances around the business;
Profiles are just one example as to how dbt helps data engineers work more like developers, supporting practices such as the adoption of development environments and seperation between configuration and code.
At the top of the dbt_profiles.yml file, we can also specify some minor global configuration setting which specify things such as how dbt logs it's results and whether or not it uses colours at the terminal.
config: send_anonymous_usage_stats: <true | false> use_colors: <true | false> partial_parse: <true | false> printer_width: <integer> write_json: <true | false> warn_error: <true | false> warn_error_options: <include: all | include: [<error-name>] | include: all, exclude: [<error-name>]> log_format: <text | json | default> debug: <true | false> version_check: <true | false> fail_fast: <true | false> use_experimental_parser: <true | false> static_parser: <true | false>
Again, these are user specific concerns rather than being tied the specific project, so it makes sense to manage these outside of the model code.
These settings are described in more detail here.
In the examples below we have created new projects and created the profile placedholders as a side effect of that.
However, it is also possible to checkout an existing project and use dbt init to build the profile.
rm -rf ~/.dbt/profiles.yml cd ecommerce_analytics dbt init
This will give us a correctly formatted profiles.yml script.
Out of the box, this would be the standard template which still needs fully populating. However, the developer of the model project can specify defaults and prompt for information using the profile_template.yml file in order to ensure that the profile is correctly configured.
After changing setting in the profiles.yml file, it can be useful to re-validate these connection details with dbt debug. As well as checking the file is correctly formatted, dbt debug will also attempt to connect to the databases to confirm we have correct credentials in place.
Which should output that all three database connections are healthy:
Running with dbt=0.21.0 dbt version: 0.21.0 python version: 3.8.10 python path: /usr/bin/python3 os info: Linux-5.10.47-linuxkit-x86_64-with-glibc2.29 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /ecommerce_analytics/dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: localhost port: 5432 user: postgres database: pizzastore schema: dev_pizzastore search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
We can also validate other targets like so:
dbt debug -t prod
Running with dbt=0.21.0 dbt version: 0.21.0 python version: 3.8.10 python path: /usr/bin/python3 os info: Linux-5.10.47-linuxkit-x86_64-with-glibc2.29 Using profiles.yml file at /root/.dbt/profiles.yml Using dbt_project.yml file at /ecommerce_analytics/dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: localhost port: 5432 user: postgres database: pizzastore schema: prod_pizzastore search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
In the dbt profiles described above, note that we can specify a number of threads for each database target. This is the maximum number of parallel activities that dbt will execute at any one time.
Though executing more work in parallel is generally a good thing, increasing this to an arbitrarily high number isn't necessarily the best thing to do. It will add extra load to the machine running the dbt client, the data warehouse, and perhaps even result in a net slowdown due to too much parallel work. In environments such as Snowflake where we have conumption based billing, this may also have a cost implication if we need to create more higher powered virtual warehouses.
This said, increasing the number of threads is a valid option if you are looking to improve the speed and latency of your data transformations.
Note that your dbt transformation may consist of a set of dependent steps. Threads will respect these dependencies meaning that you may also be limited in the amount of parralelism that can be achieved if you have a long chain of dependent transformations to resolve.