Combining Cube And ClickHouse For User Facing Analytics

The high performance of ClickHouse makes it particularly suitable for user-facing analytics use cases. This is where we embed analytics directly into a product or user experience which are accessed by a large number of external end users or consumers.

A powerful database engine is not however enough to deliver user facing analytics. In addition, other requirements are likely to emerge, including:

API Access - User facing analytics are often served over websites or single page applications built in frameworks such as React or Next.js. Though ClickHouse has a number of official and third party drivers, web developers often prefer to request data from some backend API through stateless REST or GraphQL APIs rather than querying a database directly;
Caching - Despite the high performance of ClickHouse, it can still be beneficial to implement caching to reduce load on the database, especially when the results haven't changed. Caching will typically happen at multiple tiers, but a cache that sits in front of the database and invalidates when new data is available can have an important role to play;
Semantic Layer - With ClickHouse, there is a tendency to keep data in relatively low level de-normalised formats. For that reason, it can add value to expose a "semantic layer" which models data as concepts that are more convenient for developers to query and expose;
Security - If you have a multi-tenant application, it is important to keep data isolated to specific users. Though this can be done in multiple ways and in multiple layers, again there are arguments for adding a layer of protection into your API layer to ensure that data is never exposed to unauthorised users.

These are type of challenges which are not particularly differentiating, but which developers would need to solve for if they wish to include user facing analytics in their applications.

Introducing Cube

Cube, formerly named cube.js, elegantly solves all of these problems in one convenient package.

It works by acting as a middleware layer between your application and your database and provides all of the above features in a layered architecture.

In this article, we are going to walk through how Cube is used with ClickHouse, and explain how it delivers the four features discussed - API Access, Caching, a Semantic Layer and Security. We will also provide access to a sample repository to demonstrate how this looks in a next.js application.

Cube can be deployed as an open source project, or as a hosted cloud service. In this walkthrough, we will use Cube Cloud to demonstrate it's integration with ClickHouse, but it works equally well using the open source deployment.

Connecting To ClickHouse

The first thing we need to do is to connect Cube Cloud to our ClickHouse instance. ClickHouse is supported out of the box without any additional drivers, and support for ClickHouse is very complete even where we are building complex queries and using joins, cache invalidation and other features.

Schema Generation

After connecting to ClickHouse, Cube will automatically propose schemas which describe the dimensions and metrics in your project. These can be enhanced with accurate data types and readable descriptions in order to more fully develop your semantic model.

Schemas are built and imported into your project in either YAML or JavaScript format. From this point, these can be edited directly to add new fields or build new measures.

cubes:
  - name: nypd_complaint
    sql_table: default.NYPD_Complaint

    joins: []

    dimensions:
      - name: complaint_number
        sql: complaint_number
        type: string
        primary_key: true

      - name: borough
        sql: borough
        type: string

      - name: was_crime_completed
        sql: was_crime_completed
        type: string

    measures:
      - name: count
        type: count

Cube incorporates a Git based development workflow, where you can edit and test your files on a branch before committing to a main branch. When they are committed, they are deployed to your development instance. This workflow is very well implemented.

API Access

We then get onto the most important feature of Cube, which is in how it provides APIs in front of the database, in our case ClickHouse.

The model is that queries are constructed using JSON and GraphQL, and then sent into Cube. In turn, Cube will translate them into the underlying SQL to interact with ClickHouse. This means that we do not have to embed any SQL in the application and have a nice layer of indirection.

Ordinarily, application developers would spend a lot of time developing object-relational mapping (ORM) solutions like this, and it's completely non-differentiating, repetitive and error prone code.

Cube Playground

The best place to begin building queries is in the Cube Playground. This will allow you to visually build queries in the Cube GUI that include aggregations, filters and time limitations. These queries can then be copied into your application wihtout having to hand build JSON or GraphQL queries.

In the example below, we are looking at the NYPD Police Complaint dataset which has been loaded into our ClickHouse Cloud instance. We have requested a count of complaints grouped by borough for the whole time period. A table is returned that shows that most complaints were in Brooklyn.

By clicking on the JSON button, we can see the query which would need to be made to Cube to execute the query through the JSON API:

Or the equivalent can be generated in GraphQL format:

Building queries through the playground and then copying them into the application codebase is much faster and less error prone than building them manually within your code.

Charts Convenience Functions

The Playground will also give boilerplate code for rendering charts and tables inside JavaScript applications. You can choose a framework such as React, Angular and Vue, and libraries such as Chart.js, Recharts and Chartkick and again get the code which can be copied directly into your codebase.

Again, this saves times for developers, avoiding the need to hand craft code to render charts.

Caching

Out of the box, Cube makes use of caching to prevent queries needing to visit the database and speed up responses to the users.

By default, query results are cached in memory, and the same query will return the same results.

In the case of ClickHouse, any query results will be cached for 10 seconds, which makes it most useful when we have highly concurrent and bursty access. It may be appropriate to reduce or increase this number dependent on how fresh you need data to be and the volumes of users that you expect.

To customise the time window, a refresh key can be added to your models, which can be defined like either on a schedule, or by looking for changes in a specific field:

cubes:
  - name: orders
    refresh_key:
      every: 1 minute

cubes:
  - name: orders
    refresh_key:
      sql: SELECT MAX(order_id) FROM orders

When the refesh key expires or changes, the cache is cleared for the particular query and the next query invocation will visit the database.

Pre-Aggregations

The cache described above is an in memory cache.

It is also possible to use Cube to pre-aggregate queries and store the results in the database, avoiding the need to run the underlying query repeatedly. This reduces load on the database and again speeds up queries by pre-computing the results to common aggregations.

In ClickHouse, this would be the equivalent of using a materialised view. As ClickHouse materialised views are very powerful, pre-aggregations have less applicability for us. However, there may be some situations where it makes sense to defer this requirement to Cube.

cubes:
  - name: customers
    sql_table: customers
 
    pre_aggregations:
      - name: customers_by_tier
        refresh_key:
          every: 1 day
        dimensions:
          - customers.tier
        measures:
          - customersBE.count

By default, pre-aggregations are only refreshed every hour. Depending on the requirements of your users, this can be increased or decreased as appropriate again by using refresh keys.

Semantic Layer

A semantic layer involves applying a domain specific model in front of your data that makes it easier to understand, mapping from the raw, technical data infrastructure into the language of the business users who need to analyze and interpret it. This simplifies data access and provides a layer of indirection if the underlying technical representations change.

This is particularly useful in the case of ClickHouse, where we often store data as relatively raw and unprocessed datasets. Using Cube, we can continue to do this whilst exposing a higher level API to users.

By developing the schema shown above, we are already developing a semantic layer. For instance, below we map the field odr_vol_suppl to the more descriptive name supplier_order_volume.

 dimensions:
      - name: supplier_order_volume
        sql: odr_vol_suppl
        type: string

This does go further though. We can join underlying tables to expose a single business concept:

cubes:
  - name: orders
    joins:
      - name: order_lines
        relationship: one_to_one
        sql: "{orders}.id = {order_lines.order_id}"

Or we can use segments to limit an underlying table to just a subset of rows:

cubes:
  - name: gold_tier_customers
    segments:
      - name: customers
        sql: "{CUBE}.tier = 'Gold'"

Security

The final main use of Cube is in adding an additional layer of security to properly isolate data from users and implement role based access control.

Again, this could be solved at multiple layers, for instance using ClickHouse RBAC and in your application, but again it may make sense to implement an additional layer of security in front of the database.

The model is based on JWTs, which are passed into Cube through the Authorization header on the request. The JWT payload is then used to apply additional WHERE clauses into any queries which are sent to the endpoint.

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    
    if (securityContext.user_id) {
      query.filters.push({
        member: "orders_view.users_id",
        operator: "equals",
        values: [securityContext.user_id],
      });  
    }
 
    return query;
  },
};

This query rewriting means that users can never use Cube to access data within ClickHouse without the appropriate level of authorisation.

Should We Use Cube?

We think that using Cube is almost entirely upside when developing web applications that integrate with ClickHouse. JSON and GraphQL APIs out of the box, an in memory cache, pre-aggregation functions and additional security hooks. The open-source platform is very powerful, and the cloud service provides a fully managed option together with the Git based development workflow. Developers developing web applications in front-of ClickHouse should definitely consider this architecture.

Optimising Vehicle Routing Using ClickHouse Cloud And Google OR-Tools