In this lesson we will:
- Explain how to integrate ClickHouse with Delta Lake.
Delta Lake is an open source table format for storing data in a format suitable for analytical purposes. It is based around Apache Parquet, but adds features such as schemas, the ability to read and write to files, transactions and time travel.
Delta Lake is sponsored by Databricks and is well integrated into Spark, and as such has gained considerable adoption as one of the main table formats used within cloud data platforms. It does have competitors such as Apache Iceberg and Apache Hudi, but Delta Lake is arguably the most popular table format in use today.
As more data is being stored in Delta Lake format, there is an increasing chance that we will need to integrate this and work with it within ClickHouse. In addition, using ClickHouse over a a Data Lake to deliver a Data Lakehouse is in many ways a desirable end state architecture.
As a first step, ClickHouse now provides a read only view over Delta files, though over time this is expected to become bi-directional.
We will now walk through the process of creating a file in Delta Lake format, uploading it to S3, and accessing it from ClickHouse in order to demonstrate the process.
Rather than using AWS for real, we will use localstack to simulate S3 locally.
We will begin by creating a file in Delta format which we will later load into ClickHouse.
The easiest way to do this is using the open source version of Apache Spark. Begin by downloading Apache Spark from the Spark website.
Having downloaded and unpacked Spark, the next step is to start a spark-sql session to create our Delta Lake file using SQL. In order to do this, we need to specify the Delta library as a package as this is not installed by default in open source Spark.
./bin/spark-sql --packages io.delta:delta-spark_2.12:3.0.0rc1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Once the Spark SQL shell has loaded, we can create a table as normal, adding the using delta clause to indicate that we would like to serialize the table in Delta format:
create table my_delta_table( val integer ) using delta; insert into my_delta_table values( 123 ); insert into my_delta_table values( 456 );
If you then close the shell and check the spark-warehouse folder within your Spark installation, you should see a folder named my_delta_table which contains a structure such as the following. This is our Delta Lake table which will ultimately want to access from ClickHouse.
-rw-r--r--@ 1 benjaminwootton staff 12 10 Oct 20:30 .part-00000-1a252b70-dd78-4a47-b2fe-46479e8b711d-c000.snappy.parquet.crc -rw-r--r--@ 1 benjaminwootton staff 12 10 Oct 20:29 .part-00000-95b7b4e6-dcae-4047-b942-9202db4dd2a1-c000.snappy.parquet.crc drwxr-xr-x@ 8 benjaminwootton staff 256 10 Oct 20:30 _delta_log -rw-r--r--@ 1 benjaminwootton staff 455 10 Oct 20:30 part-00000-1a252b70-dd78-4a47-b2fe-46479e8b711d-c000.snappy.parquet -rw-r--r--@ 1 benjaminwootton staff 455 10 Oct 20:29 part-00000-95b7b4e6-dcae-4047-b942-9202db4dd2a1-c000.snappy.parquet
ClickHouse currently only provides read only support for Delta Files, and only those that are stored on AWS S3. This situation should however evolve rapidly.
Rather than using S3 for this demo, it is easy to use localstack which simulates AWS on your local machine.
After downloading and installing localstack, it can be ran in the following way. This instantiates localstack within a Docker container, and starts an S3 service which is listening and ready to have buckets created and files uploaded.
DEBUG=1 localstack start
Next, we need to create a bucket on the localstack S3 to hold our Delta table:
aws s3api create-bucket --bucket my-delta-table --endpoint-url=http://localhost:4566 --region local --create-bucket-configuration LocationConstraint=local
We can then push our local Delta files from the Spark folder to the S3 bucket:
cd spark-warehouse aws s3 cp --recursive ./my_delta_table s3://my-delta-table --endpoint-url=http://localhost:4566
Finally, we can test that this was succesful by checking the following URL:
curl http://localhost:4566/my-delta-table/ <?xml version='1.0' encoding='utf-8'?>
So now we have created a Delta table and uploaded it to the mocked S3 running on localstack on our local machine.
The next and final step is to connect to it from ClickHouse by creating a table using the DeltaLake table engine.
create table my_delta_table ENGINE=DeltaLake('http://localhost:4566/my-delta-table/')
Once the table is created, we can select from it as normal:
select * from my_delta_table;
Which returns the following results:
Query id: d0d19eac-c1ec-4c2c-a0f0-23f06213d914 ┌─val─┐ │ 345 │ └─────┘ ┌─val─┐ │ 456 │ └─────┘ 2 rows in set. Elapsed: 5.888 sec.
We now have connectivity between ClickHouse and the Delta Lake file.
As mentioned, we may need to do this for integration purposes as Delta Lake becomes more popular, or perhaps it is a desirable end state whereby we use ClickHouse to query Delta Lake files stored in S3.