Using ClickHouse Cloud To Power ML Feature Stores - A Fraud Detection Example

Click here to download the sample SageMaker notebook

In a recent blog post we walked through the process of combining AWS SageMaker with ClickHouse to develop and deploy a machine learning model.

In that article we demonstrated the value that a high performance OLAP database such as ClickHouse could offer across the MLOps lifecycle, from exploratory analysis, managing training data, maintaining feature stores and supporting observability and monitoring.

In this article we wanted to dig deeper into one element of that - the feature store. This article builds on a recent post by the ClickHouse team Powering Feature Stores With ClickHouse, but shows a slightly different approach and technology stack based on SageMaker, the AWS machine learning platform.

In this article, we also want to touch on how we can use this technology stack for explainability in order to examine the impact of the different features on the model output. We will make use of a feature called SageMaker Clarify to support this.

What Is A Feature Store?

A feature store is a place to store the cleaned and prepared input data that we use to train a machine learning model.

Though technically this could simply be a table in a database, the feature store abstraction provides the following benefits:

Features can be shared with different teams and team members, avoiding duplication of effort and creating consistency across the organisation;
Features can be shared across training and inference. If we train a model, we can then use the same data at inference time;
Metadata can be captured such as the version, lineage information, created and update dates, and the author. This can be used to improve reproducibility and auditability.

There are feature store products and platforms such as Featureform which support and manage this process. However, we view the feature store as more of a pattern or workflow as opposed to a piece of software. This means that it can be developed directly in ClickHouse without any dependency on a third party piece of software.

What Is Feature Engineering?

Feature Engineering is the process by which we take raw data and derive features from it. This could involve selecting a subset of data, filling in gaps, normalising data or encoding values to make it ready for use.

Features are built or engineeered and the results are then placed into the feature store where they can then be used for both training and inference.

During model development, this could be a one off activity. However, as machine learning models move to production, this process would need to be automated and continually executed in order to keep the feature store up to date as new source data is created.

A Real World Example

We are now going to demonstrate feature stores in ClickHouse using a fraud detection scenario.

We will be working with the Credit Card Transactions Fraud Detection Dataset, which has been loaded into a ClickHouse table called fraud_data.

SELECT
    split,
    trans_date_trans_time,
    category,
    cc_num,
    merchant,
    city,
    is_fraud
FROM fraud_data
ORDER BY trans_date_trans_time ASC
LIMIT 10

Query id: 827aa119-d9cc-4b6d-b998-72c734c800f3

┌─split─┬─trans_date_trans_time─┬─category──────┬───────────cc_num─┬─merchant────────────────────────────┬─city─────────┬─is_fraud─┐
│ train │   2019-01-01 00:52:57 │ grocery_pos   │ 3501509250702469 │ fraud_Hackett-Lueilwitz             │ Naples       │        0 │
│ train │   2019-01-01 00:57:17 │ shopping_net  │  378006354724784 │ fraud_Ankunding LLC                 │ Etlan        │        0 │
│ train │   2019-01-01 00:58:02 │ gas_transport │    4481131401752 │ fraud_Greenholt, Jacobi and Gleason │ Shrewsbury   │        0 │
│ train │   2019-01-01 01:02:53 │ grocery_net   │     571365235126 │ fraud_VonRueden Group               │ Elberta      │        0 │
│ train │   2019-01-01 01:09:41 │ gas_transport │ 4683520018489354 │ fraud_Ledner-Pfannerstill           │ Hawthorne    │        0 │
│ train │   2019-01-01 01:15:01 │ grocery_net   │ 6011492816282597 │ fraud_Kutch-Hegmann                 │ Smackover    │        0 │
│ train │   2019-01-01 01:17:44 │ grocery_pos   │ 4149238353975790 │ fraud_Mosciski, Gislason and Mertz  │ Cokeburg     │        0 │
│ train │   2019-01-01 01:32:28 │ misc_net      │   30270432095985 │ fraud_Herman, Treutel and Dickens   │ Lomax        │        0 │
│ train │   2019-01-01 01:49:37 │ grocery_net   │ 3583635130604947 │ fraud_Cassin-Harvey                 │ Philadelphia │        0 │
│ train │   2019-01-01 02:12:48 │ grocery_pos   │ 6011552879910390 │ fraud_Lockman Ltd                   │ Lonetree     │        0 │
└───────┴───────────────────────┴───────────────┴──────────────────┴─────────────────────────────────────┴──────────────┴──────────┘

10 rows in set. Elapsed: 0.004 sec. Processed 8.19 thousand rows, 788.98 KB (1.94 million rows/s., 187.21 MB/s.)
Peak memory usage: 14.72 MiB.

The dataset contains 120,000 credit card transactions between the 1st of January 2019 and the 31st of December 2020. Approximately 4% of the transactions are fraudulent.

SELECT
    is_fraud,
    count(*)
FROM fraud_data
GROUP BY is_fraud

Query id: 433037b4-36dd-403c-b977-7db4c90c4bb5

┌─is_fraud─┬─count()─┐
│        0 │  114960 │
│        1 │    5040 │
└──────────┴─────────┘

2 rows in set. Elapsed: 0.005 sec. Processed 120.00 thousand rows, 80.67 KB (24.46 million rows/s., 16.44 MB/s.)
Peak memory usage: 49.06 KiB.

Create A Feature Store In ClickHouse

Our feature engineering step will be based on a ClickHouse view. The view is built against the fraud_data table, and performs a series of transformations into a view called fraud_feature_store. This view will be our feature store which will be used for both training and inference.

We will use as features the merchant and category of the transaction, the customer's age and city, the transaction amount, the hour and day of the week of the transaction, the distance between the customer's address and the merchant's address, the average amount of the customer's transactions over the last 24 hours and the number of transactions made by the customer over the last 24 hours.

We use various ClickHouse functions for this feature engineering, such as the toHour and toDayOfWeek functions which extract the hour and weekday from a timestamp, the geoDistance function which calculates the distance in meters between two pairs of latitude-longitude coordinates (in this case the customer's address and the merchant's address) as well as Clickhouse window functions which allow us to calculate the number of transactions and the average transaction amount of each customer over rolling windows of 24 hours.

client.command(
    '''
    drop view if exists fraud_feature_store sync
    '''
)

client.command(
    '''
    create view
        fraud_feature_store
    as select
        trans_date_trans_time,
        cc_num,
        merchant,
        category,
        toInt32((now() - dob) / (60 * 60 * 24 * 365)) as customer_age,
        city as customer_city,
        amt as transaction_amount,
        toHour(trans_date_trans_time) as transaction_hour,
        toDayOfWeek(trans_date_trans_time) as transaction_weekday,
        geoDistance(long, lat, merch_long, merch_lat) / 1000 as merchant_distance,
        avg(amt) over (partition by cc_num order by toHour(trans_date_trans_time) Range between 24 preceding and current row) as average_transaction_amount_1_day,
        count(*) over (partition by cc_num order by toHour(trans_date_trans_time) Range between 24 preceding and current row) as transaction_count_1_day
    from
        fraud_data
    '''
)

Once built, the view has the following contents:

SELECT *
FROM fraud_feature_store
LIMIT 10

Query id: 40531386-12b4-4838-8180-641149cf490a

┌─trans_date_trans_time─┬──────cc_num─┬─merchant─────────────────────────┬─category──────┬─customer_age─┬─customer_city─┬─transaction_amount─┬─transaction_hour─┬─transaction_weekday─┬─merchant_distance─┬─average_transaction_amount_1_day─┬─transaction_count_1_day─┐
│   2019-09-15 00:07:00 │ 60416207185 │ fraud_Bogisich Inc               │ grocery_pos   │           37 │ Fort Washakie │              72.01 │                0 │                   7 │    117.1016484375 │                66.28666666666666 │                       3 │
│   2020-06-11 00:05:38 │ 60416207185 │ fraud_Schumm, Bauch and Ondricka │ grocery_pos   │           37 │ Fort Washakie │              89.63 │                0 │                   4 │       98.90740625 │                66.28666666666666 │                       3 │
│   2020-12-30 00:16:46 │ 60416207185 │ fraud_Skiles-Ankunding           │ grocery_net   │           37 │ Fort Washakie │              37.22 │                0 │                   3 │      86.341390625 │                66.28666666666666 │                       3 │
│   2019-04-08 01:14:17 │ 60416207185 │ fraud_Rodriguez Group            │ gas_transport │           37 │ Fort Washakie │               62.2 │                1 │                   1 │     41.5816171875 │                            65.69 │                      12 │
│   2019-03-01 01:32:53 │ 60416207185 │ fraud_Marks Inc                  │ gas_transport │           37 │ Fort Washakie │              13.17 │                1 │                   5 │     48.0112578125 │                            65.69 │                      12 │
│   2019-03-17 01:08:52 │ 60416207185 │ fraud_Harber Inc                 │ gas_transport │           37 │ Fort Washakie │              103.5 │                1 │                   7 │      110.28409375 │                            65.69 │                      12 │
│   2019-06-02 01:55:54 │ 60416207185 │ fraud_Kuhic Inc                  │ grocery_pos   │           37 │ Fort Washakie │             102.49 │                1 │                   7 │     97.0711328125 │                            65.69 │                      12 │
│   2019-07-11 01:46:43 │ 60416207185 │ fraud_Jaskolski-Dibbert          │ grocery_net   │           37 │ Fort Washakie │              37.04 │                1 │                   4 │     66.2563203125 │                            65.69 │                      12 │
│   2019-09-18 01:45:12 │ 60416207185 │ fraud_McDermott-Weimann          │ grocery_pos   │           37 │ Fort Washakie │             100.19 │                1 │                   3 │       68.41340625 │                            65.69 │                      12 │
│   2019-11-25 01:33:06 │ 60416207185 │ fraud_Lind, Huel and McClure     │ gas_transport │           37 │ Fort Washakie │              38.13 │                1 │                   1 │    45.04716796875 │                            65.69 │                      12 │
└───────────────────────┴─────────────┴──────────────────────────────────┴───────────────┴──────────────┴───────────────┴────────────────────┴──────────────────┴─────────────────────┴───────────────────┴──────────────────────────────────┴─────────────────────────┘

10 rows in set. Elapsed: 0.122 sec. Processed 120.00 thousand rows, 15.04 MB (981.78 thousand rows/s., 123.07 MB/s.)
Peak memory usage: 40.34 MiB.

Creating this view or feature store means that any data scientist could use this cleaned data as inputs into their machine learning model, without needing to reinvent the wheel.

Though we have used a simple view here, we could have taken a few different approaches:

We could have used a materialised view to materialise the fraud_feature_store table into disk. This would improve the performance at inference time and make it easier to keep the feature store up to date as the base table is updated;
We could have made use of an external tool such as dbt to externalise the view definition into a source controlled file. Features could then be rebuilt and tested periodically as a batch;
We could have made use of a virtual feature store product such as Featureform as outlined in the ClickHouse blog post to manage this process.

We believe that using a simple view demonstrates again how ClickHouse can simplify your technology stack, relying on it's horsepower to avoid external dependencies.

Preparing Training Data

Because we are using SageMaker built in algorithms, we will unfortunately need to extract our training data from ClickHouse into static files stored on S3 for training. If we were develop a custom algorithm then we could of course connect directly to ClickHouse.

When we build the CSV, we will make sure to include the target variable in the first column as required by the built in algorithm.

train_data = client.query_df(
    '''
    select
        a.is_fraud as is_fraud,
        b.merchant as merchant,
        b.category as category,
        b.customer_age as customer_age,
        b.customer_city as customer_city,
        b.transaction_amount as transaction_amount,
        b.transaction_hour as transaction_hour,
        b.transaction_weekday as transaction_weekday,
        b.merchant_distance as merchant_distance,
        b.average_transaction_amount_1_day as average_transaction_amount_1_day,
        b.transaction_count_1_day as transaction_count_1_day
    from
        fraud_data as a
    left join
        fraud_feature_store as b
    on
        a.trans_date_trans_time == b.trans_date_trans_time
    and
        a.cc_num == b.cc_num
    where
        a.split == 'train'
    order by
        a.trans_date_trans_time asc
    '''
)

train_data_file = session.upload_string_as_file_body(
    body=train_data.to_csv(index=False, header=False),
    bucket=bucket,
    key='train/data.csv'
)

We also create a JSON file with the column indices of the categorical features and save it in S3 together with the training data.

cat_index_file = session.upload_string_as_file_body(
    body=json.dumps(dict(cat_index_list=[1, 2, 4, 6, 7])),
    bucket=bucket,
    key='train/categorical_index.json'
)

Training The Model

Now that the training data is available in S3, we can proceed to training the model.

We will use the CatBoost SageMaker algorithm for classifying the transactions into fraudulent (positive class) and not fraudulent (negative class).

Given that the dataset is imbalanced (as the proportion of fraudulent transactions is very small), we set the scale_pos_weight hyperparameter (which by default is set equal to one) equal to the ratio between the number of training samples in the negative class and the number of training samples in the positive class.

# get the default hyperparameters
hyperparameters = retrieve_default(
    model_id=model_id,
    model_version=model_version
)

# update the 'scale_pos_weight' hyperparameter
hyperparameters['scale_pos_weight'] = (train_data['is_fraud'] == 0).sum() / (train_data['is_fraud'] == 1).sum()

After that we proceed with training the model:

# retrieve the URI of the training image
training_image_uri = sagemaker.image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope='training',
    instance_type=instance_type
)

# retrieve the URI of the training script
training_source_uri = sagemaker.script_uris.retrieve(
    model_id=model_id,
    model_version=model_version,
    script_scope='training',
)

# retrieve the URI of the pre-trained model; this is not used
# in this case, but the estimator requires it as an input
training_model_uri = sagemaker.model_uris.retrieve(
    model_id=model_id,
    model_version=model_version,
    model_scope='training',
)

# create the estimator
estimator = sagemaker.estimator.Estimator(
    role=role,
    image_uri=training_image_uri,
    source_dir=training_source_uri,
    model_uri=training_model_uri,
    instance_count=1,
    instance_type=instance_type,
    max_run=3600,
    output_path=f's3://{bucket}/output',
    entry_point='transfer_learning.py',
    hyperparameters=hyperparameters,
)

# run the training job
estimator.fit({'training': f's3://{bucket}/train'})

Deploy The Model

After the model has been trained, we deploy it to a real-time inference endpoint.

# retrieve the URI of the inference image
inference_image_uri = sagemaker.image_uris.retrieve(
    region=None,
    framework=None,
    image_scope='inference',
    model_id=model_id,
    model_version=model_version,
    instance_type=instance_type,
)

# retrieve the URI of the inference script
inference_source_uri = sagemaker.script_uris.retrieve(
    model_id=model_id,
    model_version=model_version,
    script_scope='inference',
)

# deploy the model
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    entry_point='inference.py',
    image_uri=inference_image_uri,
    source_dir=inference_source_uri,
    explainer_config=ExplainerConfig(
        clarify_explainer_config=ClarifyExplainerConfig(
            inference_config=ClarifyInferenceConfig(
                feature_headers=feature_headers,
                probability_attribute='probabilities'
            ),
            shap_config=ClarifyShapConfig(
                shap_baseline_config=ClarifyShapBaselineConfig(
                    mime_type='text/csv',
                    shap_baseline=serializer.serialize(baseline)
                ),
                seed=42,
            )
        )
    ),
)

The model has now been deployed into a fully managed endpoint which supports real-time, interactive, low latency and autoscaling inference. Remember that it was trained on data that was engineered and maintained in our ClickHouse feature store.

Testing The Model

Our next step is to test and evaluate the model performance, which will do by evoking the endpoint.

We stream the data from our ClickHouse feature store where the column split='test' in data frame format, invoke the endpoint with each block of test data, and save the results back to ClickHouse into a new table fraud_model_evaluation.

# create a Clickhouse table for storing the results
client.command(
    '''
    drop table if exists fraud_model_evaluation sync
    '''
)

client.command(
    '''
    create or replace table fraud_model_evaluation (
        trans_date_trans_time DateTime,
        cc_num Int64,
        actual_label Float64,
        predicted_label Float64,
        predicted_probability Float64,
    )
    engine
        MergeTree
    order by
        trans_date_trans_time
    '''
)

# stream blocks of test data from Clickhouse
with client.query_df_stream(
    '''
    select
        a.trans_date_trans_time as trans_date_trans_time,
        a.cc_num as cc_num,
        a.is_fraud as is_fraud,
        b.merchant as merchant,
        b.category as category,
        b.customer_age as customer_age,
        b.customer_city as customer_city,
        b.transaction_amount as transaction_amount,
        b.transaction_hour as transaction_hour,
        b.transaction_weekday as transaction_weekday,
        b.merchant_distance as merchant_distance,
        b.average_transaction_amount_1_day as average_transaction_amount_1_day,
        b.transaction_count_1_day as transaction_count_1_day
    from
        fraud_data as a
    left join
        fraud_feature_store as b
    on
        a.trans_date_trans_time == b.trans_date_trans_time
    and
        a.cc_num == b.cc_num
    where
        a.split == 'test'
    order by
        a.trans_date_trans_time asc
    '''
    ) as stream:

    for block in stream:

        # invoke the endpoint with the test data
        response = session.sagemaker_runtime_client.invoke_endpoint(
            EndpointName=predictor.endpoint_name,
            ContentType='text/csv',
            Body=serializer.serialize(block.iloc[:, 3:]),
            EnableExplanations="`false`"
        )

        # deserialize the endpoint's response
        response = deserializer.deserialize(response['Body'], 'text/csv')

        # extract the predicted probabilities
        probabilities = np.array(json.loads(response['predictions']['data'])['probabilities'])[:, 1]

        # save the results in Clickhouse
        client.insert_df(
            'fraud_model_evaluation',
            pd.DataFrame({
                'trans_date_trans_time': block['trans_date_trans_time'],
                'cc_num': block['cc_num'],
                'actual_label': block['is_fraud'],
                'predicted_label': np.where(probabilities > 0.5, 1., 0.),
                'predicted_probability': probabilities,
            })
        )

We now have a table which contains the predicted label and the actual label, which will allow us to test the overall performance of our model using techniques such as root mean squared error (RMSE) and the mean absolute percentage error (MAPE).

SELECT *
FROM fraud_model_evaluation
LIMIT 10

Query id: 943b716e-111d-4764-8b22-33dddec888d7

┌─trans_date_trans_time─┬──────────────cc_num─┬─actual_label─┬─predicted_label─┬──predicted_probability─┐
│   2020-06-21 12:29:36 │ 4110266553600176127 │            0 │               0 │    0.04302730470872631 │
│   2020-06-21 12:30:28 │        571465035400 │            0 │               0 │  0.0022659228438543038 │
│   2020-06-21 12:38:04 │      30234966027947 │            0 │               0 │   0.020925610768929633 │
│   2020-06-21 12:54:28 │      38947654498698 │            0 │               0 │   0.001706752155331339 │
│   2020-06-21 12:55:21 │ 4503101193493052864 │            0 │               0 │ 0.00027441968721855255 │
│   2020-06-21 13:10:12 │    2296006538441789 │            0 │               0 │ 0.00008039290651046504 │
│   2020-06-21 13:16:07 │      30118423745458 │            0 │               0 │   0.016814877898505554 │
│   2020-06-21 13:17:31 │    3540210836308425 │            0 │               0 │  0.0019521478695133723 │
│   2020-06-21 13:48:14 │    3568255211412877 │            0 │               0 │  0.0004861581170288029 │
│   2020-06-21 13:51:45 │    6011367958204270 │            0 │               0 │  0.0015719526206322725 │
└───────────────────────┴─────────────────────┴──────────────┴─────────────────┴────────────────────────┘

10 rows in set. Elapsed: 0.048 sec. Processed 8.19 thousand rows, 237.86 KB (171.28 thousand rows/s., 4.97 MB/s.)
Peak memory usage: 2.62 MiB.

Explainability With SageMaker Clarify

Explainability involves being able to understand and reason about how our machine learning models are behaving. This is important to businesses if they would like to avoid risks such as bias, security, regulatory and reputational risk.

With regards to feature stores, our task here is to understand which features in our feature store are contributing the most to our predictions. Again we will tie this process back into ClickHouse to explain how it could work.

As mentioned above, we will make use of SageMaker Clarify in order to examine the impact of the different features on the model output.

We set the baseline with respect to which the Shapley Values are calculated equal to the mode in the case of categorical features, and to the mean in the case of numerical features. We calculate the mean and the mode directly in the Clickhouse feature store using the avg and topK functions.

baseline = client.query_df(
    '''
    select
        topK(1)(merchant) as merchant,
        topK(1)(category) as category,
        avg(customer_age) as customer_age,
        topK(1)(customer_city) as customer_city,
        avg(transaction_amount) as transaction_amount,
        topK(1)(transaction_hour) as transaction_hour,
        topK(1)(transaction_weekday) as transaction_weekday,
        avg(merchant_distance) as merchant_distance,
        avg(average_transaction_amount_1_day) as average_transaction_amount_1_day,
        avg(transaction_count_1_day) as transaction_count_1_day
    from
        fraud_feature_store
    '''
)

baseline.to_dict(orient='records')

We now invoke the endpoint with online explainability enabled such that we can examine the feature importances. In the interest of time, we perform the analysis only for a few rows of the test dataset.

# create a Clickhouse table for storing the results
client.command(
    '''
    drop table if exists fraud_model_explanation sync
    '''
)

client.command(
    '''
    create or replace table fraud_model_explanation (
        trans_date_trans_time DateTime,
        cc_num Int64,
        actual_label Float64,
        predicted_label Float64,
        predicted_probability Float64,
        merchant Float64,
        category Float64,
        customer_age Float64,
        customer_city Float64,
        transaction_amount Float64,
        transaction_hour Float64,
        transaction_weekday Float64,
        merchant_distance Float64,
        average_transaction_amount_1_day Float64,
        transaction_count_1_day Float64,
    )
    engine
        MergeTree
    order by
        trans_date_trans_time
    '''
)

# extract a few rows of test data from Clickhouse
test_data = client.query_df(
    '''
    select
        a.trans_date_trans_time as trans_date_trans_time,
        a.cc_num as cc_num,
        a.is_fraud as is_fraud,
        b.merchant as merchant,
        b.category as category,
        b.customer_age as customer_age,
        b.customer_city as customer_city,
        b.transaction_amount as transaction_amount,
        b.transaction_hour as transaction_hour,
        b.transaction_weekday as transaction_weekday,
        b.merchant_distance as merchant_distance,
        b.average_transaction_amount_1_day as average_transaction_amount_1_day,
        b.transaction_count_1_day as transaction_count_1_day
    from
        fraud_data as a
    left join
        fraud_feature_store as b
    on
        a.trans_date_trans_time == b.trans_date_trans_time
    and
        a.cc_num == b.cc_num
    where
        a.split == 'test'
    order by
        a.trans_date_trans_time asc
    limit 100
    '''
)

# invoke the endpoint with the test data
response = session.sagemaker_runtime_client.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    ContentType='text/csv',
    Body=serializer.serialize(test_data.iloc[:, 3:]),
    EnableExplanations="`true`"
)

# deserialize the endpoint's response
response = deserializer.deserialize(response['Body'], 'text/csv')

# extract the predicted probabilities
probabilities = np.array(json.loads(response['predictions']['data'])['probabilities'])[:, 1]

# extract the Shapley values
explanations = pd.concat([
    pd.DataFrame(
        data={k['feature_header']: [k['attributions'][0]['attribution'][1]] for k in x}, 
    ) for x in response['explanations']['kernel_shap']
])

# save the results in the Clickhouse table
client.insert_df(
    'fraud_model_explanation',
    pd.DataFrame({
        'trans_date_trans_time': test_data['trans_date_trans_time'].values,
        'cc_num': test_data['cc_num'].values,
        'actual_label': test_data['is_fraud'].values,
        'predicted_label': np.where(probabilities > 0.5, 1., 0.),
        'predicted_probability': probabilities,
        'merchant': explanations['merchant'].values,
        'category': explanations['category'].values,
        'customer_age': explanations['customer_age'].values,
        'customer_city': explanations['customer_city'].values,
        'transaction_amount': explanations['transaction_amount'].values,
        'transaction_hour': explanations['transaction_hour'].values,
        'transaction_weekday': explanations['transaction_weekday'].values,
        'merchant_distance': explanations['merchant_distance'].values,
        'average_transaction_amount_1_day': explanations['average_transaction_amount_1_day'].values,
        'transaction_count_1_day': explanations['transaction_count_1_day'].values,
    })
)

We can now analyse the results in ClickHouse, where we saved both the model outputs and the feature attributions.

SELECT *
FROM fraud_model_explanation
LIMIT 10

Query id: d7097693-c5ab-4714-9996-399eda0210e0

┌─trans_date_trans_time─┬──────────────cc_num─┬─actual_label─┬─predicted_label─┬──predicted_probability─┬──────────────merchant─┬───────────────category─┬────────────customer_age─┬─────────customer_city─┬────transaction_amount─┬──────transaction_hour─┬───transaction_weekday─┬───────merchant_distance─┬─average_transaction_amount_1_day─┬─transaction_count_1_day─┐
│   2020-06-21 12:29:36 │ 4110266553600176127 │            0 │               0 │    0.04302730470872631 │   0.00982178999650104 │   -0.02460694543601373 │ -0.00011661386529183623 │   0.05159162364605566 │ 0.0015407232864325503 │  -0.14202288009813765 │   0.07553506940002079 │  -0.0019023776701484509 │             0.003278128173949688 │     0.02295955269815257 │
│   2020-06-21 12:30:28 │        571465035400 │            0 │               0 │  0.0022659228438543038 │  0.003184952630352725 │   0.008119352910513111 │    0.002832550569565564 │ -0.008840078188409957 │ -0.012703952625243148 │  -0.04370520848579032 │  0.003778340435853218 │                       0 │            0.0007926177380136704 │   0.0018581132817937634 │
│   2020-06-21 12:38:04 │      30234966027947 │            0 │               0 │   0.020925610768929633 │  0.008284493892270605 │    -0.0609892980659499 │    0.002558141459956878 │  -0.07282305715766804 │   0.27464089451608065 │  -0.18374202760324096 │  0.001148789573975257 │                       0 │           -0.0035848349633956153 │      0.0084832745396951 │
│   2020-06-21 12:54:28 │      38947654498698 │            0 │               0 │   0.001706752155331339 │  0.011751691911841748 │   -0.09519081291035747 │   0.0011007135730096915 │ -0.042992826362180825 │   0.16565244817650526 │  -0.10712887452801056 │ 0.0010426037430713714 │                       0 │             -0.00180003111903327 │    0.022322605093279685 │
│   2020-06-21 12:55:21 │ 4503101193493052864 │            0 │               0 │ 0.00027441968721855255 │  0.006262467326881518 │   -0.03183625466381353 │   0.0008419549616608991 │ -0.018299092204194654 │  0.025497548374576864 │ -0.040728485535872716 │  0.005063120250484903 │ -0.00011839562838687409 │            0.0006559201225166296 │     0.00598640210615984 │
│   2020-06-21 13:10:12 │    2296006538441789 │            0 │               0 │ 0.00008039290651046504 │  0.006483196455949189 │   -0.03859595909562995 │  -0.0001613836303338105 │ -0.016637690078431154 │   0.02342166738625069 │  -0.02743513574182637 │ 0.0065411796046134715 │                       0 │            -0.005850509147153514 │    0.005365792575866238 │
│   2020-06-21 13:16:07 │      30118423745458 │            0 │               0 │   0.016814877898505554 │  0.008196753312638062 │  -0.018647233538194394 │ -0.00010717481952803552 │  0.011632334603547052 │   0.02219848401182192 │    -0.091553360247703 │  0.037341596666860766 │  -0.0009280051684007699 │           -0.0022167911261086037 │   0.0039490396263668695 │
│   2020-06-21 13:17:31 │    3540210836308425 │            0 │               0 │  0.0019521478695133723 │  0.001562864785141437 │   0.005745233479306727 │  -0.0001717654058456322 │ -0.009340324910739994 │  -0.01184749578764574 │  -0.03564268879027224 │  0.003411740652857297 │  -0.0006815050956945207 │            0.0002180557371159659 │   0.0017487986280843928 │
│   2020-06-21 13:48:14 │    3568255211412877 │            0 │               0 │  0.0004861581170288029 │  0.006894657739249881 │ -0.0019682269507874488 │ -0.00018531706600835324 │  -0.01756068404320486 │  0.013687753056055706 │  -0.04958858288243883 │  0.004308834618308815 │  -0.0014173203861177824 │           -0.0035733376830726087 │   0.0029391471378386116 │
│   2020-06-21 13:51:45 │    6011367958204270 │            0 │               0 │  0.0015719526206322725 │ 0.0016091960382235806 │  -0.014918809297852779 │  -0.0001817600187756993 │ -0.016747371300801783 │  0.024536097282455992 │  -0.04508522765515939 │  0.006576748914075408 │                       0 │            0.0009308074346148863 │  -0.0020969633533536175 │
└───────────────────────┴─────────────────────┴──────────────┴─────────────────┴────────────────────────┴───────────────────────┴────────────────────────┴─────────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴─────────────────────────┴──────────────────────────────────┴─────────────────────────┘

10 rows in set. Elapsed: 0.046 sec.

The average absolute Shapley values indicate that the most important features are the transaction amount, the transaction category and the time and location of the transaction.

fraud_model_explanation.iloc[:, 5:].abs().mean().sort_values().plot(kind='barh')

This is a simple example, but shows how we have closed the loop from our feature store to being able to explain which features are most relevant during inference.

Perform Real Time Inference With Online Feature Store

In the code below, we demonstrate a more realistic scenario where transaction events are processed in real-time.

We stream the credit card transactions from Clickhouse as pairs of timestamp and account number, we then extract the transaction's features from the ClickHouse online feature store, invoke the model endpoint with these features and save the results back to Clickhouse. In the interest of time, we run the code below only for a few transaction events.

# create a Clickhouse table for storing the results
client.command(
    '''
    drop table if exists fraud_real_time_predictions sync
    '''
)

client.command(
    '''
    create or replace table fraud_real_time_predictions (
        trans_date_trans_time DateTime,
        cc_num Int64,
        predicted_label Float64,
        predicted_probability Float64,
        merchant Float64,
        category Float64,
        customer_age Float64,
        customer_city Float64,
        transaction_amount Float64,
        transaction_hour Float64,
        transaction_weekday Float64,
        merchant_distance Float64,
        average_transaction_amount_1_day Float64,
        transaction_count_1_day Float64,
    )
    engine
        MergeTree
    order by
        trans_date_trans_time
    '''
)

# stream the transactions from Clickhouse
with client.query_rows_stream(
    '''
    select
        trans_date_trans_time,
        cc_num
    from
        fraud_data
    where
        split == 'test'
    order by
        trans_date_trans_time asc
    limit 10
    '''
    ) as stream:

    for row in stream:

        # simulating an online feature store here
        # lookup the transaction's features in the Clickhouse feature store
        features = client.query_df(
            f'''
            select
                merchant,
                category,
                customer_age,
                customer_city,
                transaction_amount,
                transaction_hour,
                transaction_weekday,
                merchant_distance,
                average_transaction_amount_1_day,
                transaction_count_1_day
            from
                fraud_feature_store
            where
                trans_date_trans_time == toDateTime('{str(row[0])}')
            and
                cc_num == {row[1]}
            '''
        )

        # invoke the endpoint with the features
        response = session.sagemaker_runtime_client.invoke_endpoint(
            EndpointName=predictor.endpoint_name,
            ContentType='text/csv',
            Body=serializer.serialize(features),
            EnableExplanations="`true`"
        )

        # deserialize the endpoint's response
        response = deserializer.deserialize(response['Body'], 'text/csv')

        # extract the predicted probabilities
        probabilities = json.loads(response['predictions']['data'])['probabilities'][0][1]

        # extract the Shap values
        explanations = {k['feature_header']: k['attributions'][0]['attribution'][1] for k in response['explanations']['kernel_shap'][0]}

        # save the results in the Clickhouse table
        client.insert_df(
            'fraud_real_time_predictions',
            pd.DataFrame({
                'trans_date_trans_time': row[0],
                'cc_num': row[1],
                'predicted_label': np.where(probabilities > 0.5, 1., 0.),
                'predicted_probability': probabilities,
                'merchant': explanations['merchant'],
                'category': explanations['category'],
                'customer_age': explanations['customer_age'],
                'customer_city': explanations['customer_city'],
                'transaction_amount': explanations['transaction_amount'],
                'transaction_hour': explanations['transaction_hour'],
                'transaction_weekday': explanations['transaction_weekday'],
                'merchant_distance': explanations['merchant_distance'],
                'average_transaction_amount_1_day': explanations['average_transaction_amount_1_day'],
                'transaction_count_1_day': explanations['transaction_count_1_day'],
            }, index=[0])
        )

We can then query Clickhouse to examine the results of the real-time inference:

SELECT *
FROM fraud_real_time_predictions
LIMIT 10

Query id: 3e983ba3-affa-46c2-bf31-0ad7e1458e91

┌─trans_date_trans_time─┬──────────────cc_num─┬─predicted_label─┬──predicted_probability─┬─────────────merchant─┬──────────────category─┬────────────customer_age─┬─────────customer_city─┬────transaction_amount─┬──────transaction_hour─┬───transaction_weekday─┬───────merchant_distance─┬─average_transaction_amount_1_day─┬─transaction_count_1_day─┐
│   2020-06-21 12:29:36 │ 4110266553600176127 │               0 │    0.04302730470872631 │  0.00982178999650104 │  -0.02460694543601373 │ -0.00011661386529183623 │   0.05159162364605566 │ 0.0015407232864325503 │  -0.14202288009813765 │   0.07553506940002079 │  -0.0019023776701484509 │             0.003278128173949688 │     0.02295955269815257 │
│   2020-06-21 12:30:28 │        571465035400 │               0 │  0.0022659228438543038 │ 0.003184952630352725 │  0.008119352910513111 │    0.002832550569565564 │ -0.008840078188409957 │ -0.012703952625243148 │  -0.04370520848579032 │  0.003778340435853218 │                       0 │            0.0007926177380136704 │   0.0018581132817937634 │
│   2020-06-21 12:38:04 │      30234966027947 │               0 │   0.020925610768929633 │ 0.008284493892270605 │   -0.0609892980659499 │    0.002558141459956878 │  -0.07282305715766804 │   0.27464089451608065 │  -0.18374202760324096 │  0.001148789573975257 │                       0 │           -0.0035848349633956153 │      0.0084832745396951 │
│   2020-06-21 12:54:28 │      38947654498698 │               0 │   0.001706752155331339 │ 0.011751691911841748 │  -0.09519081291035747 │   0.0011007135730096915 │ -0.042992826362180825 │   0.16565244817650526 │  -0.10712887452801056 │ 0.0010426037430713714 │                       0 │             -0.00180003111903327 │    0.022322605093279685 │
│   2020-06-21 12:55:21 │ 4503101193493052864 │               0 │ 0.00027441968721855255 │ 0.006262467326881518 │  -0.03183625466381353 │   0.0008419549616608991 │ -0.018299092204194654 │  0.025497548374576864 │ -0.040728485535872716 │  0.005063120250484903 │ -0.00011839562838687409 │            0.0006559201225166296 │     0.00598640210615984 │
│   2020-06-21 13:10:12 │    2296006538441789 │               0 │ 0.00008039290651046504 │ 0.006483196455949189 │  -0.03859595909562995 │  -0.0001613836303338105 │ -0.016637690078431154 │   0.02342166738625069 │  -0.02743513574182637 │ 0.0065411796046134715 │                       0 │            -0.005850509147153514 │    0.005365792575866238 │
│   2020-06-21 13:16:07 │      30118423745458 │               0 │   0.016814877898505554 │ 0.008196753312638062 │ -0.018647233538194394 │ -0.00010717481952803552 │  0.011632334603547052 │   0.02219848401182192 │    -0.091553360247703 │  0.037341596666860766 │  -0.0009280051684007699 │           -0.0022167911261086037 │   0.0039490396263668695 │
│   2020-06-21 13:17:31 │    3540210836308425 │               0 │  0.0019521478695133723 │ 0.001562864785141437 │  0.005745233479306727 │  -0.0001717654058456322 │ -0.009340324910739994 │  -0.01184749578764574 │  -0.03564268879027224 │  0.003411740652857297 │  -0.0006815050956945207 │            0.0002180557371159659 │   0.0017487986280843928 │
└───────────────────────┴─────────────────────┴─────────────────┴────────────────────────┴──────────────────────┴───────────────────────┴─────────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴─────────────────────────┴──────────────────────────────────┴─────────────────────────┘
┌─trans_date_trans_time─┬───────────cc_num─┬─predicted_label─┬─predicted_probability─┬──────────────merchant─┬───────────────category─┬────────────customer_age─┬─────────customer_city─┬───transaction_amount─┬─────transaction_hour─┬──transaction_weekday─┬──────merchant_distance─┬─average_transaction_amount_1_day─┬─transaction_count_1_day─┐
│   2020-06-21 13:48:14 │ 3568255211412877 │               0 │ 0.0004861581170288029 │  0.006894657739249881 │ -0.0019682269507874488 │ -0.00018531706600835324 │  -0.01756068404320486 │ 0.013687753056055706 │ -0.04958858288243883 │ 0.004308834618308815 │ -0.0014173203861177824 │           -0.0035733376830726087 │   0.0029391471378386116 │
│   2020-06-21 13:51:45 │ 6011367958204270 │               0 │ 0.0015719526206322725 │ 0.0016091960382235806 │  -0.014918809297852779 │  -0.0001817600187756993 │ -0.016747371300801783 │ 0.024536097282455992 │ -0.04508522765515939 │ 0.006576748914075408 │                      0 │            0.0009308074346148863 │  -0.0020969633533536175 │
└───────────────────────┴──────────────────┴─────────────────┴───────────────────────┴───────────────────────┴────────────────────────┴─────────────────────────┴───────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴────────────────────────┴──────────────────────────────────┴─────────────────────────┘

10 rows in set. Elapsed: 0.070 sec.

Conclusion

In this article we demonstrated how ClickHouse can be used to power online feature stores without any dependence on third party software.

We carried out feature engineering directly within ClickHouse and demonstrated how our features could be ingested into a feature store, in this instance using a simple ClickHouse view.

We trained a model within SageMaker and deployed it to an AWS real-time endpoint. We simulated an online feature store, looking up features from the feature store in order to make an online inference.

We interacted with this endpoint to test model performance and to demonstrate explainability which is likely to be critically important as businesses move more machine learning systems to production.

All in all, this is a simple but powerful appraoch, and the performance of ClickHouse is a real differentiator for this use case.

Together with our first blog post, we hope we are making the point that ClickHouse is a great fit for supporting the MLOps workflow!