In this lesson we will:
- Querying S3 As A URL;
- S3 Functions;
- S3 Table Engine.
Cloud object stores such as AWS S3 are becoming an increasingly common place for data teams to store their physical data. These services are relatively cheap, reliable, globally replicated and secure, making them suitable and attractive for this use case.
In addition, as more teams choose to run their ClickHouse instances in AWS, or migrate to AWS hosted deployments of ClickHouse Cloud, it also makes sense to store data co-located alongside the database cluster which makes S3 the natural choice.
In this lesson we wil learn about what is involved in connecting to S3, initially to query data "in place" without copying it into ClickHouse. We may wish to do this as part of a Data Lakehouse approach where we are making use of data that is stored persistently in a data lake.
Then, we will look at the process and considerations when copying data into ClickHouse, which is a more typical use case which would enable the best possible performance.
We suggest you complete our lesson on table functions as a pre-requisite to this one, as much of the interaction between ClickHouse and S3 occurs through the table function abstraction.
If you have data stored on an S3 bucket with it's contents publically exposed, the simplest way is simply to query it using the URL table function.
Of course, it would be very unlikely and irresponsible to store your private data on a publically accessible bucket, but we often find that publically accessible test datasets are distributed in this way.
The URL table function simply takes a URL and a format. In the example below we use the CSVWithNames format to indicate that the URL contains CSV files with headers.
SELECT id FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.csv.gz', 'CSVWithNames') LIMIT 5
┌─────id─┐ │ 344065 │ │ 344066 │ │ 344067 │ │ 344068 │ └────────┘ ┌─────id─┐ │ 344069 │ └────────┘
Note that we are also querying compressed .gz files, and ClickHouse knows how to handle this.
Next, we can combine this technique with a CREATE TABLE command to infer and build a schema from the remote file and then ingest the data.
CREATE TABLE hackernews ENGINE = MergeTree ORDER BY tuple() AS SELECT * FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.csv.gz', 'CSVWithNames') LIMIT 10
SELECT count(*) FROM hackernews
┌─count()─┐ │ 10 │ └─────────┘
This is the simplest way to ingest data from S3 if your data is accessible to your ClickHouse server without authentication.
Ordinarily, we will need to provide authentication details to access the data stored on S3. This means that we need to provide an access key and a secret which are referred to as aws_access_key_id and aws_secret_access_key respectively.
To specify these, we will need to use the S3 table function instead of the generic URL one shown above, and specify the parameters at the point that we use it.
The call is very similar, and we have similar features such as the ability to query compressed .gz files and the ability to specify a format, in this case tab seperated as opposed to CSV.
CREATE TABLE trips ENGINE = MergeTree ORDER BY tuple ( ) AS SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_0.gz', 'SOME_ACCESS_KEY', 'SOME_SECRET_ACCESS_KEY', 'TabSeparatedWithNames')
It is also possible to write into tables abstracted by the S3 table function:
The S3 table functions are useful for ad-hoc interactions with S3, but it means exposing the AWS keys to users and including them repeatedly in our scripts which is not a particularly scalable solution.
For a more scalable solution, we can make use of the S3 table engine. This involves creating a ClickHouse table as a first class object, which interacts with S3 transparently in the background.
From a security perspective, this method allows you to encode the access keys into the table definition. After that, we can interact with the table as we would any other ClickHouse table without any credentials.
The table is created in the following way:
CREATE TABLE s3_engine_table (name String, value UInt32) ENGINE=S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-test-bucket-768/test-data.csv.gz', 'CSV', 'gzip') SETTINGS input_format_with_names_use_header = 0;
We can then select from the table:
At this point, we are accessing the data which remains stored in S3. This is a valid place to stop if we are aiming for a Data Lakehouse approach, whereby we use ClickHouse to query data stored outside of ClickHouse. However, a more typical approach, and one which will deliver the best possible performance is to ingest the data into the ClickHouse instance like so:
create table clickhouse_table as select * from s3_engine_table;