In this lesson we will:
- Explain the Data Lakehouse pattern;
- Explain how it unifies the best of Data Lakes and Data Warehouse.
It is clear the Data Warehouses and Data Lakes both have relative advantages. Data Warehouses are great for relational data and making it accessible to Data Analysts and Business Users for business intelligence workloads, whilst Data Lakes are great for storing raw, unstructured data that enable ad-hoc and exploratory analysis.
With this in mind, many data teams choose to implement both technologies. Data initially flows into some data lake, where it is then consumed and ingested into the Data Warehouse to service business intelligence users in a more familiar environment.
Whilst this does offer the best of both worlds, it does have some significant downsides:
- Businesses need to implement and manage both types of technology, which has significant cost implications;
- Two copies of the data need to be stored and copied around, which could potentially lead to conflicting data.
Noticing this pattern, many vendors are attempting to unify the two technologies such that Data Lakes have more of the benefits and features of the Data Warehouse and vice versa. Where these come together, it is often referred to as a Data Lakehouse architecture. This is one of the key themes playing out in the industry today and an area of huge innovation.
The advantages of combining the two include:
- Only one technology platform needs to be deployed and managed, reducing cost, overhead and improving time to value;
- This shared data source can be used by Data Analysts and Data Scientists who can adopt common tooling and patterns;
- All data can be stored in the more modern lake structure, avoiding much of the painful ETL development, whilst also giving data analysts the organised and structured interface they need;
- There is only one copy of the data, giving us a single source of truth.
Databricks have the lead in developing this capability of integrating the Data Warehouse and Data Lake, but they are by no means the only vendor on this journey. Snowflake for instance is moving away from marketing itself as a Data Warehouse to becoming more of a Data Platform which incorporates a lot of lake type features, whilst the Azure Synapse Analytics is also bringing this to life in an Azure native solution.
Data Lakehouse can sound like a marketing buzzword, but there is real substance behind it and it is likely to be one of transformation stories for data strategy in the enterprise going forward.