In this lesson we will:
- Introduce the concept of the Data Lakehouse;
- Consider the advantages and disadvantages of this approach.
The Data Lakehouse is an architectural pattern for storing, managing and accessing data.
Traditionally, businesses have stored their analytical data in centralised Data Warehouses. Since the early 2000s, the idea of a the Data Lake emerged as a means of storing high volumes of structured and unstructured data in it's raw foramts.
The idea of the Data Lakehouse involves taking the Data Lake as a starting point, and adding features to it which are more commonly associated with the traditional Data Warehouse. For instance:
- The ability to query the data lake via standard SQL;
- The ability to update or delete data from the lake at row level;
- The ability to have transactional controls around the data lake to support multi-user access;
- The ability to add access controls to data.
When we do this, we begin to get the best features of the Data Lake and the best features of the Date Warehouse combined into one offering.
Though this unification is a relatively recent development, it potentially has big implications for enterprise data strategy.
The primary advantage of the Data Lakehouse is that we only need to maintain one system. Before Lakehouse, many organisations found themselves maintaining a Data Lake for some workloads, and a relational Data Warehouse for others. Often, they would be copying data between the two, and needing to keep it in sync. The data lakehouse means that businesses only need to maintain one system and one repository of data, which is a huge win.
Lakehouse architecture also gives us the best of the Data Lake and the best of the Data Warehouse. Data Lakes are simple, cost effective, scalable and reliable because they are based on commodity object storage such as AWS S3. Relational Data Warehouses on the other hand are simple and very familiar to data professionals, whilst bringing additional benefits such as the ability to use SQL and implement governance and controls. To combine both approaches into one location means that fewer compromises need to be made.
Though Data Lakehouse is a powerful idea, there are a few disadavantages and limitations to consider at this early stage of their evolution.
Firstly, administering a Data Lakehouse is more complex than managing a relational data warehouse. Administrators of the Data Lakehouse will find themselves needing to carry out activities which the warehouse would have done for them, such as manually organising the data, administering compute clusters, working with cloud object stores, and implementing additional controls and governance. All in all, it is a less out-of-the-box experience. And though the aim is to treat the Lakehouse as we would a relational data warehouse, some of the abstractions do leak through meaning they are more complex to administer.
Secondly, Data Lakehouse architecture is new. This means there can be some rough edges in the user and administration experience as we try to treat our Data Lake as a Date Warehouse.
Finally, Data Lakehouses do not always perform as well as Data Warehouses. A data lakehouse based on Databricks and Spark for instance will outperform a Data Warehouse with very large data sets, but the overhead of co-ordinating the query can mean that interactive queries against smaller datasets a standard relational query against a data warehouse will perform better.
This said, all of these should continue to improve as Data Lakehouse matures as a pattern and as the supporting technology matures.