This article is part of a series where we look at doing data science work within ClickHouse. Articles in this series include forecasting, anomaly detection, linear regression and time series classification.
Though this type of analysis would more typically take place outside of ClickHouse in a programming language such as Python or R, our preference is to take things as far as possible using just the database.
By taking this approach, we can rely on the power of ClickHouse to process large datasets with high performance, and reduce or even totally avoid the amount of code that we need to write. This also means that we can work with smaller in-memory datasets on the client side and potentially avoid the need for distributed computation using frameworks such as Spark.
In this article, we will demonstrate a simple time series classification technique using Clickhouse machine learning functions.
We will look at time series of electical power demand in Italy, and automatically classify them into two categories - either fall / winter (between October and March) or days in the spring / summer (between April and September). In the dataset, the -1 label is summer, and +1 is winter.
Obviously, power demand is very different in the peak of summer and peak of winter. However, in Spring and Autumn where the seasons cross, the classification task becomes more difficult. There are also anomalies through the year such as cool days in the summer and warm days in the winter. Intuitively we should therefore expect the model to perform fairly well, but it is not an entirely trivial exercise.
Regardless of the problem, it is an interesting demonstration of the technique and how to work with time series data in ClickHouse, so on we go.
Each time series in the dataset represents one day (24 hours) of electrical power demand in Italy. The dataset has the following columns:
- Split - Whether the time series belongs to the training set (train) or to the test set (test).
- Label - The time series binary label, either -1 or 1 representing summer and winter respectively.
- h1-h24 - the time series values over 24 hours.
A sample of the dataset is shown below:
SELECT split, label, h1 FROM power_demand LIMIT 5 Query id: 25dc4b0d-9d0c-4cf8-a638-507540e31499 ┌─split─┬─label─┬─────────h1─┐ │ test │ -1 │ 0.47297301 │ │ test │ -1 │ -1.0987371 │ │ test │ -1 │ -1.0294417 │ │ test │ -1 │ 0.29127426 │ │ test │ -1 │ -0.8146062 │ └───────┴───────┴────────────┘ 5 rows in set. Elapsed: 0.002 sec. Processed 1.10 thousand rows, 31.85 KB (565.67 thousand rows/s., 16.44 MB/s.) Peak memory usage: 1.25 MiB.
If we take the average hourly power demand across the two periods and plot them then we get the following visualisation. The blue line represents summer (-1) and the orange line represents winter (1).
We can see that power demand in the winter is slightly higher than the summer, particularly in the evneing, but the two lines do cross and average out to be relatively close which again demonstrates that this is not a trivial classification problem.
In our previous examples we had to manually partition our dataset between test and train. In this instance though, our dataset is already split.
We can also see that the data is balanced as the two labels (-1 and 1) appear with similar frequencies, both in the training set and in the test set.
SELECT split, label, count(*) FROM power_demand GROUP BY split, label Query id: 148ec1da-8fa8-4ddf-8080-189d52b1ec04 ┌─split─┬─label─┬─count()─┐ │ test │ 1 │ 513 │ │ test │ -1 │ 516 │ │ train │ 1 │ 34 │ │ train │ -1 │ 33 │ └───────┴───────┴─────────┘
The time series are already standardized as they have mean equal to zero and standard deviation equal to one, and therefore no additional preprocessing is required.
We use a logistic regression model to predict the time series labels from the time series values. We fit the model to the training set using Clickhouse's stochasticLogisticRegression function and save the results to a model table:
CREATE OR REPLACE VIEW power_demand_model AS AS SELECT stochasticLogisticRegressionState(0.1, 0.0, 4, 'SGD')( label, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h12, h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, h23, h24 ) AS state FROM power_demand WHERE split = 'train'
After fitting the model to the training set, we can use it to make prediction on the test set.
Note that the model returns the predicted probabilities and not the class labels. In order to transform the predicted probabilities into class labels, we compare them with the standard decision threshold of 0.5: if the predicted probability is greater than 0.5 then the predicted class label is 1, otherwise it's -1.
CREATE OR REPLACE VIEW power_demand_results AS WITH ( SELECT state FROM power_demand_model ) AS model SELECT label AS ACTUAL_LABEL, evalMLMethod(model, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h12, h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, h23, h24) AS PREDICTED_PROBABILITY, if(PREDICTED_PROBABILITY > 0.5, 1., -1.) AS PREDICTED_LABEL FROM power_demand
This generates the following results view which maps the predicted probability to a label, and then compares it with the actual observed result in our test partition:
SELECT * FROM power_demand_results LIMIT 5 Query id: 5570eca8-dc09-4d1f-a4a6-00e1446553a8 ┌─ACTUAL_LABEL─┬─PREDICTED_PROBABILITY─┬─PREDICTED_LABEL─┐ │ -1 │ 0.23409648001131916 │ -1 │ │ -1 │ 0.3862228028377256 │ -1 │ │ -1 │ 0.3815130710249039 │ -1 │ │ -1 │ 0.4265333624840515 │ -1 │ │ -1 │ 0.39001302830997614 │ -1 │ └──────────────┴───────────────────────┴─────────────────┘
As shown in the plot below, the model classifies most time series correctly, given that the predicted probability is lower than 0.5 for most of the time series in the negative class (-1) and greater than 0.5 for most of the time series in the positive class (1).
We find that the model achieves a 97% accuracy on the test set.
Considering the points made above and how closely the two time series average across the two labels, we find this to be quite remarkable performance.
SELECT avg(if(ACTUAL_LABEL = PREDICTED_LABEL, 1., 0.)) FROM power_demand_results Query id: 88f1e866-ca4b-4746-8b47-5f5b8bbe7b82 ┌─avg(if(equals(ACTUAL_LABEL, PREDICTED_LABEL), 1., 0.))─┐ │ 0.967930029154519 │ └────────────────────────────────────────────────────────┘ 1 row in set. Elapsed: 0.011 sec. Processed 2.19 thousand rows, 467.03 KB (192.64 thousand rows/s., 41.04 MB/s.) Peak memory usage: 13.49 MiB.
In this article we have demonstrated how we could take time series data and classify it against two labels - winter or summer. We trained with half of the dataset, and tested with the other half and acheived an impressive 97% accuracy.
Though the classification task sounded trivial on the surface, we explained how this became harder when we considered spring and autumn, and the fact that the two labelled periods averaged out to have very similar power consumption.
Again, this analysis was completed entirely within ClickHouse, with only a little Python used to render visualisations in the notebook.
A notebook describing the full worked example can be found at this URL.