In this lesson we will:
- Consider a real time analytics scenario;
- Explain how this could be implemented in a traditional data warehouse;
- Explain the trade-offs in when to introduce stream processing.
If we have a requirement for processing data in real time, we are faced with an architectural choice.
On the one hand, we could use a stream processing approach perhaps based on a framework such as Apache Flink to derive the analytics and respond to situations as data is continuously generated.
On the other hand, we could put our data directly into some high performance database or data warehouse and then query.
Imagine that an eCommerce company would like to see a real-time summary of orders that have been placed in the last hour in order to continually optimise their online advertising.
Without the real time requirement, there are many ways to solve this problem. With it however, we are slightly more constrained. Our options include:
- Hook into the source applications and access the required data directly: In this example, this might involve querying an eCommerce or ERP platform either through the API or directly into the database. Generally, querying transactional systems and databases in this way is not advisable. It could compromise stability of the application, it could be fragile to schema changes and upgrades, and it could offer security challenges. We would also need to frequently 'poll' the transactional system to ask which new orders have been placed. All said, this is not a good direction;
- Integrate the necessary data into a centralised data warehouse, OLAP store, or data lake as quickly as possible: This is the go-to approach today. It involves frequently extracting the data from the source application, and moving it into a centralised data warehouse or data lake for querying and analysis. Historically, this process has been slow and batch based with delays of hours or days before data is available. The modern data stack does improve this by implementing streaming technologies and continuous ingestion, but even much of that technology only aspires to be between 1 and 30 minutes in getting to a fully populated and materialised data available to the users;
- Implement a 'stream processor': This involves somehow ingesting the order events immediately as they happen, and calculating and maintaining the analytics as new data streams in. In this instance, it could be that our stream processor simply maintains a table of orders which have been placed in the last hour, together with some summary analytics such as the total order value. These are all updated live as data is pushed into the stream processor.
Many business with real time requirements are grappling between option 2 and option 3. They see the simplicity and the potential for doing faster data processing and analytics within the warehouse, so focus on faster ingestion and faster transformations in an aim to meet their business requirements. At the same time, data warehousing technology is evolving, giving us features such as Kafka integration, streams, scheduled tasks, external functions etc which make them more programmable and powerful. Platforms such as Snowflake and ClickHouse are at the cutting edge of this capability.
Though many business requirements do not have particularly onerous requirements for speed and data volume, there are an increasing number of businesses aspiring to be more real time and driven driven. Though Data Warehousing and OLAP stores are increasingly powerful and viable, we believe that at a certain requirement for data volumes and latency, businesses will continue to find scenarios where the data warehousing solution falls short. Two examples might be:
- Big Data Scenario - Imagine that instead of orders, our business are interested in tracking millions of website clicks, engagement metrics or data from sensors. In that instance, transferring millions of raw events into a centralised location begins to be slow, expensive and of questionable value. Instead, it might make make sense to filter, aggregate and summarise this data in flight before processing and storing it;
- Fast Data Scenario - Imagine that we wish to identify some pattern in event streams to immediately change the customer or employee experience, in a situation where seconds or minutes matter. Common examples include fraud detection, personalisation and operational analytics. The sooner we identify these situations the better, down to sub-second latency.
It is when both big and fast data combine that stream processing really moves into it's own - when we have huge volumes of data and benefit from very rapid insights.
As well as speed and volume, there is another axis here which is complexity of the real time processing and materialisations that we need to do. If we just just wish to produce simple aggregations such as orders aggregated by hour, this is naturally a use case for staying within the data warehouse, and making use of features such as Materialized views to do these rollups efficiently.
However, if wish to get into into even rudimentary statistical analysis on the stream, need procedural logic, or to interact with external systems and APIs, doing this directly within the database using tools such as stored procedures starts to smell even where it is possible.
Finally, we need to think about what we would like to do with in response to the data that we ingest, transform and analyse. If we simply want to put records and analytics into a data store ready to be served by a dashboard, it's another argument to keep everything within the data warehouse, derive the data you need, and avoid the complexity of the stream processing world. However, if we wish to get into the realms of alerting or calling external APIs in response to data, we quickly want to break out of the database and into procedural code. In this circumstance, real time stream processing is the more natural home.
All together, these are the metrics which I look for when choosing whether to go down the data warehouse or data lake route for "real time" analytics, or when to introduce stream processing. The data volumes, the required time to insight, the complexity of the analytics, and what we actually want to do having identified some situation of interest. Though the Keep It Simple approach would naturally make me the default solution data warehousing, in the right situation, I would be looking at the stream processing world.