In this lesson we will:
- Learn about open data formats such as Parquet;
- Learn about open table formats such as Apache Hudi, Apache Iceberg and Delta Lake;
- Learn the value that these formats bring to our Data Lakehouse.
With a relational Data Warehouse, as an administrator you will not necessarily have visibility as to how your system is storing files behind the scenes.
In a Data Lakehouse architecture, we have to give this more consideration and potentially make decisions around how data is stored. This includes the formats of how data is physically stored, and how we represent abstractions such as tables.
As administrators of the Lakehouse, we can choose which formats to use, and the right decision can sometimes be based on your particular usage patterns. For instance, some formats are better for compression, whilst some are better for performance.
Any format of data can be stored in a Data Lake or Data Lakehouse. For instance, it is common to find CSV files and JSON files for structured data, proprietary documents such as Word files and PDF files, and even binary files such as audio in an average enterprise Data Lake.
This said, there are some advantages to standardising on specialised file formats, and many organisations have gone done this path.
One of the most common formats used is Apache Parquet. This is an open source format which stores tabular data in a column oriented format which is optimised for analytical queries. Parquet files are also optimised for compression.
Another popular choice is ORC format. Compared with Parquet, ORC is.....
When we are implementing a Data Lakehouse, we need to represent the concept of a table, and be able to do things like insert, update, delete, roll back or time travel.
Fortunately, this part has also been opened up with the advent of table engines, of which there are a number of competing products.
Delta Lake - This is an open source table engine provided by Databricks.
Apache Iceberg - This is arguably the leading table engine outside of the Databricks ecosystem. Many third parties
Apache Hudi is the
The fact that these file formats and table formats are open source is a very appealing feature of the Data Lakehouse. They drastically reduce lock-in if we wish to change components of the stack, and make our data very interoperable to all kinds of tools.