Forecasting Using ClickHouse Machine Learning Functions

Benjamin Wootton

Benjamin Wootton

Follow me on LinkedIn
Forecasting Using ClickHouse Machine Learning Functions

This article is part of a series where we look at doing data science work within ClickHouse. Articles in this series include forecasting, anomaly detection, linear regression and time series classification.

Though this type of analysis would more typically take place outside of ClickHouse in a programming language such as Python or R, our preference is to take things as far as possible using just the database.

By taking this approach, we can rely on the power of ClickHouse to process large datasets with high performance, and reduce or even totally avoid the amount of code that we need to write. This also means that we can work with smaller in-memory datasets on the client side and potentially avoid the need for distributed computation using frameworks such as Spark.

About This Example

In this article we will look at forecasting. ClickHouse implements two machine learning functions - Stochastic Linear Regression (stochasticLinearRegression) which can be used for fitting the model, and a function (evalMLMethod) which can be used for subsequent inference directly within the database.

Of course there are more sophisticated forecasting models and more flexibility once you break out of SQL into a fully-fledged programming language, but this technique certainly has it's uses and performs well in our demonstration scenario here.

Dataset

To demonstrate, we are going to use a simple flight departure dataset which contains a monthly time series of the number of passengers departing from different airports using various airlines.

Our aim will be to take this data and use it forecast the same data into the future.

We will aim to build a model using data from 2008 to 2015, and then test the performance of the model between 2015 and 2018. Finally, we will then forecast beyond the period through till 2021.

Our souce data has the following structure:

SELECT *
FROM flight_data
LIMIT 10

Query id: ac6f2908-62b6-4532-abc8-f5caa4480144

┌─AIRLINE─┬─DEPARTURE_AIRPORT─┬──────MONTH─┬─PASSENGERS─┐
│ Delta   │ DIA               │ 2008-01-01 │        434 │
│ Delta   │ DIA               │ 2008-02-01 │        475 │
│ Delta   │ DIA               │ 2008-03-01 │        531 │
│ Delta   │ DIA               │ 2008-04-01 │        509 │
│ Delta   │ DIA               │ 2008-05-01 │        472 │
│ Delta   │ DIA               │ 2008-06-01 │        562 │
│ Delta   │ DIA               │ 2008-07-01 │        642 │
│ Delta   │ DIA               │ 2008-08-01 │        642 │
│ Delta   │ DIA               │ 2008-09-01 │        596 │
│ Delta   │ DIA               │ 2008-10-01 │        503 │
└─────────┴───────────────────┴────────────┴────────────┘

10 rows in set. Elapsed: 0.002 sec. Processed 4.62 thousand rows, 151.54 KB (2.16 million rows/s., 70.86 MB/s.)
Peak memory usage: 229.15 KiB.

When plotted, the data looks like this, showing how all airlines are carrying an increased number of passengers over time together with a significant seasonality effect.

Data Preparation

Our forecasting model uses 13 deterministic features: a linear time trend and 12 dummy (or one-hot encoded) variables representing the 12 months of the year. We exclude the constant term (or intercept) in order to avoid the "dummy variable trap".

The model predicts the logarithm of the number of passengers. The logarithmic transformation allows us to better capture the time-varying amplitude of the seasonal fluctuations.

create view
    data
as with
    (select toDate(min(MONTH)) from flight_data) as start_date,
    (select toDate(max(MONTH)) from flight_data) as end_date
select
    AIRLINE,
    DEPARTURE_AIRPORT,
    MONTH,
    toFloat64(log(PASSENGERS)) as Target,
    assumeNotNull(dateDiff('month', start_date, MONTH) / dateDiff('month', start_date, end_date)) as Trend,
    if(toMonth(toDate(MONTH)) = 1, 1, 0) as Dummy1,
    if(toMonth(toDate(MONTH)) = 2, 1, 0) as Dummy2,
    if(toMonth(toDate(MONTH)) = 3, 1, 0) as Dummy3,
    if(toMonth(toDate(MONTH)) = 4, 1, 0) as Dummy4,
    if(toMonth(toDate(MONTH)) = 5, 1, 0) as Dummy5,
    if(toMonth(toDate(MONTH)) = 6, 1, 0) as Dummy6,
    if(toMonth(toDate(MONTH)) = 7, 1, 0) as Dummy7,
    if(toMonth(toDate(MONTH)) = 8, 1, 0) as Dummy8,
    if(toMonth(toDate(MONTH)) = 9, 1, 0) as Dummy9,
    if(toMonth(toDate(MONTH)) = 10, 1, 0) as Dummy10,
    if(toMonth(toDate(MONTH)) = 11, 1, 0) as Dummy11,
    if(toMonth(toDate(MONTH)) = 12, 1, 0) as Dummy12
from
    flight_data
order by AIRLINE, DEPARTURE_AIRPORT, MONTH

This creates the following view which summarises our dependent and independent variables:

SELECT *
FROM data
LIMIT 10

Query id: e425f35c-6b49-4b44-9950-628b701845e4

┌─AIRLINE─┬─DEPARTURE_AIRPORT─┬──────MONTH─┬─────────────Target─┬────────────────Trend─┬─Dummy1─┬─Dummy2─┬─Dummy3─┬─Dummy4─┬─Dummy5─┬─Dummy6─┬─Dummy7─┬─Dummy8─┬─Dummy9─┬─Dummy10─┬─Dummy11─┬─Dummy12─┐
│ Delta   │ DIA               │ 2008-01-01 │ 6.0730445333335865 │                    0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-02-01 │  6.163314804336003 │ 0.007633587786259542 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-03-01 │  6.274762021388925 │ 0.015267175572519083 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-04-01 │  6.232448016554782 │ 0.022900763358778626 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-05-01 │  6.156978985873825 │ 0.030534351145038167 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-06-01 │ 6.3315018500618665 │  0.03816793893129771 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-07-01 │  6.464588304624293 │  0.04580152671755725 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-08-01 │  6.464588304624293 │  0.05343511450381679 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-09-01 │  6.390240666362644 │ 0.061068702290076333 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-10-01 │  6.220590170138575 │  0.06870229007633588 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       1 │       0 │       0 │
└─────────┴───────────────────┴────────────┴────────────────────┴──────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴─────────┘

10 rows in set. Elapsed: 0.010 sec. Processed 13.86 thousand rows, 170.02 KB (1.37 million rows/s., 16.81 MB/s.)
Peak memory usage: 420.28 KiB.

Model Training

We use ClickHouse's stochasticLinearRegression algorithm, which trains a linear regression using gradient descent. We build 35 different models at the same time, one for each airline-airport combination.

create view model as select
    AIRLINE,
    DEPARTURE_AIRPORT,
    stochasticLinearRegressionState(0.5, 0.01, 4, 'SGD')(
        Target, Trend, Dummy1, Dummy2, Dummy3, Dummy4, Dummy5, Dummy6, Dummy7, Dummy8, Dummy9, Dummy10, Dummy11, Dummy12
    ) as state
from train_data
group by AIRLINE, DEPARTURE_AIRPORT

As there is a small amount of data, the model is simply defined as a view. For bigger datasets, we may choose to materialize this as a table or a view.

Model Evaluation

We can now use the trained model to generate the forecasts over the test set and compare them to the actual values. At this stage we can also transform the data and the forecasts back to the original scale by taking the exponential.

select
    a.MONTH as MONTH,
    a.AIRLINE as AIRLINE,
    a.DEPARTURE_AIRPORT as DEPARTURE_AIRPORT,
    toInt32(exp(a.Target)) as ACTUAL,
    toInt32(exp(evalMLMethod(b.state, Trend, Dummy1, Dummy2, Dummy3, Dummy4, Dummy5, Dummy6, Dummy7, Dummy8, Dummy9, Dummy10, Dummy11, Dummy12))) as FORECAST
from test_data as a
left join model as b
on a.AIRLINE = b.AIRLINE and a.DEPARTURE_AIRPORT = b.DEPARTURE_AIRPORT

If we compare the forecast and the actuals, we can see that the forecast performed well:

We can validate this by calculating the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) of the forecasts for each airline-airport combination.

SELECT
    AIRLINE,
    DEPARTURE_AIRPORT,
    avg(abs(ERROR)) AS MAE,
    sqrt(avg(pow(ERROR, 2))) AS RMSE
FROM
(
    SELECT
        a.AIRLINE AS AIRLINE,
        a.DEPARTURE_AIRPORT AS DEPARTURE_AIRPORT,
        toInt32(exp(a.Target)) - toInt32(exp(evalMLMethod(b.state, Trend, Dummy1, Dummy2, Dummy3, Dummy4, Dummy5, Dummy6, Dummy7, Dummy8, Dummy9, Dummy10, Dummy11, Dummy12))) AS ERROR
    FROM test_data AS a
    LEFT JOIN model AS b ON (a.AIRLINE = b.AIRLINE) AND (a.DEPARTURE_AIRPORT = b.DEPARTURE_AIRPORT)
)
GROUP BY
    AIRLINE,
    DEPARTURE_AIRPORT

Query id: 320cad46-bb31-4248-bd25-19d98d5d2d15

┌─AIRLINE──┬─DEPARTURE_AIRPORT─┬────────────────MAE─┬───────────────RMSE─┐
│ JetBlue  │ SFO               │  86.38888888888889 │ 110.96671172523367 │
│ KLM      │ PDX               │ 167.97222222222223 │  213.4134615143936 │
│ Delta    │ SJC               │ 141.80555555555554 │  180.9452802491528 │
│ United   │ PDX               │ 115.19444444444444 │  147.7711255812703 │
│ JetBlue  │ ORL               │  97.77777777777777 │ 125.28611699271038 │
│ KLM      │ JAX               │ 121.27777777777777 │ 155.41414207064798 │
│ Delta    │ JFK               │              168.5 │  214.1754213515433 │
│ United   │ JAX               │ 153.88888888888889 │  195.9098432102549 │
│ Delta    │ SFO               │ 184.66666666666666 │ 234.34068267280344 │
│ KLM      │ DIA               │ 148.94444444444446 │ 189.77618396416344 │
│ United   │ JFK               │ 178.02777777777777 │   226.086205289536 │
│ Frontier │ ORL               │ 206.38888888888889 │ 261.27720485679146 │
│ United   │ SJC               │ 119.91666666666667 │ 153.72332650288018 │
│ KLM      │ SJC               │ 218.13888888888889 │ 275.90532796595284 │
│ KLM      │ JFK               │  70.30555555555556 │  90.43244869944515 │
│ Delta    │ JAX               │ 186.55555555555554 │ 236.69213477990067 │
│ Delta    │ ORL               │  74.44444444444444 │  95.50887102486577 │
│ Frontier │ SFO               │  63.02777777777778 │  80.91748197323548 │
│ Frontier │ PDX               │                 81 │ 103.99278821149089 │
│ United   │ ORL               │              111.5 │ 142.90031490518138 │
│ Frontier │ JAX               │  98.11111111111111 │ 125.86147588166568 │
│ Frontier │ DIA               │  95.91666666666667 │ 122.96758832219886 │
│ Delta    │ PDX               │  72.41666666666667 │  92.89046715830904 │
│ JetBlue  │ JFK               │ 141.91666666666666 │ 181.17877911057906 │
│ JetBlue  │ SJC               │              209.5 │  265.1057441013973 │
│ JetBlue  │ JAX               │ 107.30555555555556 │ 137.61893845769274 │
│ KLM      │ ORL               │ 156.77777777777777 │ 199.51287900506296 │
│ JetBlue  │ DIA               │  76.83333333333333 │  98.60076628054729 │
│ Frontier │ SJC               │  97.22222222222223 │  124.6602048236191 │
│ Frontier │ JFK               │ 156.33333333333334 │ 199.04550010264265 │
│ Delta    │ DIA               │                114 │  146.3065655092454 │
│ KLM      │ SFO               │ 119.97222222222223 │  153.7722883573847 │
│ United   │ DIA               │  72.63888888888889 │  93.25666493905706 │
│ JetBlue  │ PDX               │ 147.83333333333334 │  188.4872527372725 │
│ United   │ SFO               │ 186.83333333333334 │ 237.06668072740865 │
└──────────┴───────────────────┴────────────────────┴────────────────────┘

35 rows in set. Elapsed: 0.024 sec. Processed 18.48 thousand rows, 321.55 KB (785.99 thousand rows/s., 13.68 MB/s.)
Peak memory usage: 766.46 KiB.

Model Inference

Finally, we can now use the model for generating the forecasts beyond the last date in the dataset. For this purpose, we create a new table containing the dates and their corresponding transformations (time trend and dummy variables) over the subsequent 3 years.

create view
    future_data
as with
    (select toDate(min(MONTH)) from flight_data) as start_date,
    (select toDate(max(MONTH)) from flight_data) as end_date
select 
    AIRLINE,
    DEPARTURE_AIRPORT,
    MONTH + INTERVAL 3 YEAR as MONTH, 
    assumeNotNull(dateDiff('month', start_date, MONTH) / dateDiff('month', start_date, end_date)) as Trend,
    if(toMonth(toDate(MONTH)) = 1, 1, 0) as Dummy1,
    if(toMonth(toDate(MONTH)) = 2, 1, 0) as Dummy2,
    if(toMonth(toDate(MONTH)) = 3, 1, 0) as Dummy3,
    if(toMonth(toDate(MONTH)) = 4, 1, 0) as Dummy4,
    if(toMonth(toDate(MONTH)) = 5, 1, 0) as Dummy5,
    if(toMonth(toDate(MONTH)) = 6, 1, 0) as Dummy6,
    if(toMonth(toDate(MONTH)) = 7, 1, 0) as Dummy7,
    if(toMonth(toDate(MONTH)) = 8, 1, 0) as Dummy8,
    if(toMonth(toDate(MONTH)) = 9, 1, 0) as Dummy9,
    if(toMonth(toDate(MONTH)) = 10, 1, 0) as Dummy10,
    if(toMonth(toDate(MONTH)) = 11, 1, 0) as Dummy11,
    if(toMonth(toDate(MONTH)) = 12, 1, 0) as Dummy12
from
    test_data
order by AIRLINE, DEPARTURE_AIRPORT, MONTH

Giving us an end to end visualisation of this. Visually, we can see that the increase in passenger numbers and the seasonality has been captured by the out of range forecast.

Conclusion

In this article we have demonstrated how we can use the ML functions (stochasticLinearRegression and evalMLMethod) that are available within ClickHouse to implement a simple forecasting technique.

In principle, offloading metrics and analytics work like this to the database is a good thing. An analytical database such as ClickHouse will generally outperform and allow us to work with datasets that are bigger than can be processed on a single machine, whilst also reducing the amount of scripting work that needs to take place.

In ClickHouse, this could also be built into a materialized view, meaning that models are continually updated and retrained as new data is captured opening up real-time possibilities.

We believe this pattern could grow in future, with more data science and machine learning algorithms being implemented directly within the database.

Sample Notebook

A notebook describing the full worked example can be found at this URL.

Join our mailing list for regular insights:

We help enterprise organisations deploy advanced data, analytics and AI enabled systems based on modern cloud-native technology.

© 2024 Ensemble. All Rights Reserved.