In this lesson we will:
- Explain the benefits of clustering your ClickHouse deployment;
- BRiefly summarise the process for configuring a cluster;
- Introduce the concept of replicated tables.
It is possible to run ClickHouse in a clustered mode, where we have two or more server instances working together as a co-ordinated unit.
There are two reasons why we would want to do this:
Scalability - One of the primary reasons for clustering ClickHouse is to improve scalability. In a clustered environment, ClickHouse can handle more data and more queries than a single-node setup. This is particularly useful for large-scale data analytics, where the volume of data is continuously growing. Clustering allows you to distribute the data across multiple nodes, enabling the database to manage larger datasets efficiently and maintain high performance levels even under heavy load.
High Availability and Fault Tolerance - Clustering enhances the reliability of your database. By distributing the data across multiple nodes, you ensure that there is no single point of failure. If one node goes down, the others can continue to handle queries, ensuring continuous availability of your data. This redundancy is crucial for critical applications where data availability and uptime are paramount. Additionally, clustering can facilitate easier and more efficient backup processes, further securing your data against loss.
If you are not using ClickHouse Cloud, you will be responsible for configuring the cluster yourself through configuration files. We will now walk through the key steps at a high level:
First, you need to install ClickHouse on each node that will be part of the cluster. Ensure that all nodes have the same version of ClickHouse for compatibility.
Ensure that the nodes in your cluster can communicate with each other. This usually involves configuring firewalls and network settings to allow traffic on the ports used by ClickHouse (default is 9000 for TCP).
On each server node, you will need to edit the ClickHouse configuration file, typically located at /etc/clickhouse-server/config.xml. You will define your cluster in the <remote_servers> section.
Here's an example configuration for a simple two-node cluster:
<remote_servers> <my_cluster> <shard> <internal_replication>false</internal_replication> <replica> <host>node1-hostname</host> <port>9000</port> </replica> </shard> <shard> <internal_replication>false</internal_replication> <replica> <host>node2-hostname</host> <port>9000</port> </replica> </shard> </my_cluster> </remote_servers>
This configuration defines a cluster named my_cluster with two shards, each containing one replica.
This configuration file would be duplicated on both machines in the cluster.
If you plan to use replicated tables, you need a ZooKeeper cluster. Install and configure ZooKeeper on separate servers (or use existing ZooKeeper cluster if available). Then, add ZooKeeper configuration to the ClickHouse config.xml file:
<zookeeper> <node> <host>zookeeper1-hostname</host> <port>2181</port> </node> <!-- Additional ZooKeeper nodes if available --> </zookeeper>
ClickHouse also provide ClickHouse Keeper, which is a similar compatible system to Zookeeper, but optimised for the ClickHouse use case. Today, ClickHouse Keeper is the reccomended solution over Zookeeper, though either should work.
After updating the configuration, restart ClickHouse on each node to apply the changes:
sudo systemctl restart clickhouse-server
With the cluster set up, you can now create distributed and replicated tables. Here’s an example SQL to create a replicated table:
CREATE TABLE mydb.my_replicated_table ON CLUSTER my_cluster ( id UInt32, data String ) ENGINE = ReplicatedMergeTree('/clickhouse/
You can also create tables with the distributed table engine:
CREATE TABLE distributed_table_name AS local_table_name ENGINE = Distributed(cluster_name, database_name, local_table_name, sharding_key);
Finally, you should test the cluster by reading and writing to the replicated tables.