Relational Data Warehouse technology has been the beating heart of business intelligence for many decades.
Typically, Data Warehouses act as the centralised repository for structured data for entire businesses or departments, and are populated with data from various line of business applications.
The aim of this is to give the business a holistic and combined view of the entirety of the businesses data. The Data Warehouse stores the data in a format and structure suitable for analytics purposes, then serves up the data to analysts and business users through reports and dashboards.
As businesses requirements for their data and analytics become more demanding, the Data Warehousing model is coming under new focus. Many businesses are looking towards Data Lake architectures, and technologies such as streaming analytics are gaining traction.
Choosing a future state data architecture is a bigger question than we can answer in this one blog post, but I wanted to look at it purely from the perspective of real-time analytics. Specifically, is a data warehouse fit for purpose and still the best choice when we have requirements for very low latency or real-time access?
Query Performance
There is no denying that Data Warehouses can perform extremely well in terms of querying data. Modern Cloud Warehouses such as Snowflake, Redshift and BigQuery can scale to enormous volumes of data and serve complex queries very quickly. It's still remarkable to see a database returning an aggregation over millions or billions of rows in the blink of an eye.
Fast query performance over large data sets does not however equal real time analytics. To achieve this, we need to have high and predictable performance across the entire data lifecycle, including extracting the data from the data sources, ingesting the data into the warehouses, applying transformations to it such as cleansing and pre-aggregation of data, and potentially triggering downstream alerts.
In these areas, Data Warehouse technology starts to look less real-time. This is primarily because the Data Warehouse approach is based around batch processing.
Batch Approach
Batch processing involves ingesting and processing data in batches. For instance, data files are delivered as batches ready for ingestion, and then ETL processes run on batches of records periodically to transform and load the data into the warehouse.
As soon as we introduce a batch step, we have effectively failed to deliver on the real time requirement. Indeed, the delay to run all of this can reach minutes or even hours.
To then do something such as trigger alerts or push data downstream when a business critical situation occurs, another batch job might be implemented which runs periodically.
Small Batch Processing
Column Oriented data warehouses are designed for performant queries over large datasets. However, in real time scenarios, we are typically interested in individual records and small batches of data as data streams in. For instance, when a new order is placed, we may wish to check previous orders for the customer to see if a limit has been reached.
When we are accessing individual records in the database, this is a use case more approprioate for an OLTP/transactional relational database management systems than data warehouses. OLAP Data Warehouses often will perform relatively poorly on these types of queries. It is therefore not the right platform for implementing highly granular stream processing.
Predictability
Real Time isn't just perforabout low latency, it's also about predictabile performance. Building around the Data Warehouse and ETL model, there are simply too many opportunities and lots of unpredictability in how the end to process will perform. This means that we cannot build business critical services such as fraud checks, safety systems, or user experience personalisation on this technology stack.
Innovation In This Area
Data warehouses are attempting to raise their game in multiple areas. A modern database such as Snowflake for instance implements continuous ingestion which scales to small batches, and implements streams which give us insight when data is being created in source tables. ClickHouse, for instance has a very powerful materialised view concept. Generally though, Data Warehouses are still batch based and all of the above hold.