In this lesson we will:
- Explain the concept of a Data Lake;
- Discuss how lakes and warehouses are being combined into the Data Lakehouse.
Data Warehouses, discussed in the previous lesson, have historically been the engine of enterprise Data and Analytics platforms.
As discussed, the technology and practices around relational Data Warehousing are mature and battle tested, and still meet the vast majority of analytics challenges for the vast majority of businesses today.
There are however situations and use cases where traditional Data Warehouses have some downsides and limitations:
- Big Data - Data Warehouses can scale very well, but are not optimised for extremely high volumes of data such as log, machine or clickstream data;
- Unstructured Data - Data Warehouses are very good for structured relational table style data, but have not traditionally been as fully featured when working with semi-structured data such as XML, JSON, free text or video and audio data;
- Ad-Hoc Analysis - The Data Warehouse requires a schema to be designed up-front which is populated from your ingested data. This means that data teams have to consider in-advance how people will want to use and consume the data. This could impose limitations on the consumers of the data who want to be able to perform arbitrary ad-hoc analysis;
- Ownership Bottlenecks - The Data Warehouse tends to be owned by a centralised team, which can become a dependency or a bottleneck for Data Analysts and Data Scientists who want to get access to that data.
Modern Data Warehouse products are tackling all of the above downsides and capability gaps in different ways, such that they are becoming less relevant criticisms in todays world. They are still relevant considerations however.
Data Lakes grew as a concept from 2010 onwards, partly in response to some of the above criticisms, partly due to emerging technology, and partly due to increasing business demands to extract more value from increasingly large and complex datasets.
A Data Lake can be thought of as a place to store all of your raw, unstructured data where it is made available to your business users for their analytics use cases. As with the data warehouse, this Data can be sourced from across your business applications and data sources, and bought into the organised data lake for the purpose of analytics.
In practice, you can think of the Data Lake like a file system, where we have a tree of folders which contain different data files in different formats, potentially including CSVs, JSON, text files, data extracts and audio and video files. The key is that all of this data is relatively raw and unprocessed, usually extracted directly from the source system.
More often than not, this file system representing your Data Lake is hosted in the cloud, using some object store service such as AWS S3 or Azure Blob Store. These object stores are fast, reliable, cheap, globally distributed and easy to secure, so make a great foundation on which to build your data lake.
Where the Data Warehouse is very strong from a structure and governance point of view, the Data Lake doesn’t traditionally have these features such as schemas, constraints, access controls, and ability to roll back etc. The underlying object store will add some of these to a degree, and automation is often created around the data lake to achieve the other requirements, but this has traditionally been much easier to accomplish in a data warehouse.
Recognising this, vendors and the cloud providers continue to release new tools and capabilities to evolve the data lake. These include:
- Table formats such as Delta Lake, Apache Iceberg and Apache Hudi which can add features such as schemas, transactions and rollback to data stored within the Data Lake;
- SQL engines such as Trino which allow us to query files in the Data Lake, directly using SQL or indirectly through business intelligence tools;
- Tools to perform transformations and analytics on data stored in the data lake.
Just as we are seeing Data Warehouses evolve to combat some of their downsides, we are also seeing Data Lakes evolve to combat theirs. The two sides are overlapping and becoming more unified as the Modern Data Stack evolves. Some commentators have even suggested that given time, the distinction between warehouse and lake will fade away.