In this lesson we will:
- Demonstrate how Dagster can be used to orchestrate ClickHouse.
Dagster is an orchestration and automation platform which makes writing and running this type of automation easier and more maintainable. Dagster will take care of the following:
- Running tasks in an in automated fashion;
- Combining jobs into end to end pipelines and graphs;
- Running graphs of tasks in the most efficient way;
- Providing monitoring and alerting capabilities;
- Automatically retrying failed jobs;
- Providing a visual interface for monitoring the data orchestration processes across the business.
Data orchestration is a very common requirement within businesses who have any kind of data and analytics capability.
Traditionally, they have delivered this using combination of proprietary ETL tooling and bespoke scripting. These solutions are often difficult and risky to change, require significant maintenence, and can lack robustness and eliability. They will also lack important features such as error handling, retry logging, monitoring and secure role based access control.
Using a platform like Dagster for orchestration avoids all of the above issue, enabling your team with a much more powerful and reliable platform. This means that data teams can focus purely on their own domain rather than building bespoke automation tools.
Other benefits of Dagster include:
- Python logic - The automation tasks are defined using plain Python. This means that they are open, portable and easier to understand than say the logic embedded within a legacy ETL tool;
- Software Development Lifecycle - By moving from a proprietary tool to code, we can benefit from developer-like practices including version control and unit testing;
- Clean and reusable code - Dagster encourages clean code, where each task is seperately defined and abstracted;
- Improved reliability - Dagster will take over responsiblity for running our execution graphs, improving reliability of your data delivery;
- Better audit and logging - Dagster will provide better visibility of what actually happened to help with debugging and audit requirements;
- Single Pain Of Glass - Dagster provides a very powerful administration GUI for full oversight of what is happening in all of the data orchestration workflows.
Many data teams are implementing Modern Data Stacks to provide their data and analytics capabilities.
Dagster can be thought of as the glue code, orchestrating data between the different phases of it's journey.
Dagster itself fits our definition of a Modern Data Stack tool. It's open source, lightweight to run, very scalable, and can be used purely as a cloud hosted SaaS solution if you do not with to manage your own instance.
Dagster is not the only tool in the data orchestration space.
Apache Airflow is the most commonly deployed tool. It has many similiarites to Airflow including the Python based DAG, but it is much older which sometimes shows in the architecture.
The key differentiators of Dagster as we see them:
- Dagster introduces the concept of a data asset as a first class citizen, whereas Airflow is more of a job runner;
- Dagster makes it easier to develop and test your pipelines before moving them into production. This reduces the risk of production deploys.
Prefect also has market share in this area. Again it is a Python based platform which is modernising after the lessons of Airflow.
For SQL engineers or Software Developers moving into the Data Engineering field, workflow automation software such as Airflow and Dagster tend to be a new class of software that they haven't seen before.
For this reason, it is worth introducing the core concepts and terminology before moving forward.
Dagster introduces the concept of a data asset. An asset can be thought of as a piece of data that is stored on disk. This could represent source data, data which is the middle of a transformation process, or data that is ready to be served to some consumer.
A Dagster graph will therefore create a series of assets as it moves through the graph. Some of these assets could be discarded when the pipeline completes succesfully.
When producing an asset, it may be useful to capture metadata such as the number of rows or average price of an order. This can be used to validate that the asset was created correctly, and can also be exposed to users through the Dagster GUI.
An operation is one discrete step which you want to carry out against your software assets. Example operations might be to download a file, to transform it, to anonymise it or to copy it to it's eventual destination.
Operations can be chained together into a pipeline of steps with dependencies between each step.
In reality, operations are combined together in more complex ways than simple serial pipelines. A better analogy is the graph. A graph can have paralell phases and can have depdendencies on multiple steps such that we have a complex network of dependent operations.
Data Orchestration is what Dagster is doing at a high level as described in the previous lesson.
We define a job. A job will have an associated Dag, as well as things like a schedule on which the job should run.
A job can be defined using a graph, a collection of software assets or a collection of ops.
Schedules are used to schedule automatic runs of our Dagster pipelines on a periodic basis.
Sensors tell us when source data has changed so we know when to trigger. A sensor could be a file which has been updated or an asset which has been created by another job.