In this lesson we will:
- Introduce the Data Warehouse;
- Explain recent evolutions in Data Warehousing compared with more traditional approaches;
- Explain the distinction between OLTP and OLAP databases.
The most common repository which Data Engineers need to target for their transformed data will be a Data Warehouse.
Data Warehouses are large, centralised databases which are typically used to combine data from multiple line of business applications and data sources into a single organised location for the purpose of analytics.
For instance, a business might opt to combine all of the data from sales, marketing, finance and HR functions into a centralised Data Warehouse for a joined-up view of what is happening across the entire organisation.
Once the data has been ingested and organised within the Data Warehouse, it is then served to business stakeholders for use in reports, dashboards, and interactive analytics. Most often, this is delivered through third party business intelligence tools for building reports and dashboards.
Data Warehouses are designed and optimised to ingest and store large volumes of data, and to be able to serve the resulting business intelligence workloads with high performance. They are very much optimised for these analytical workloads in contrast to databases such as MySQL and Oracle which are designed for transactional workloads.
Data Warehouses have been in use for decades, and this is a very mature field with established tools and practices. Almost every large organisation will have at least one Data Warehouse, and vendors such as Oracle, Microsoft and IBM have historically owned most of this market with on premises solutions.
In recent years however, Data Warehousing has experienced a rapid evolution and uptick in innovation as part of the Modern Data Stack.
This process started with Data Warehousing solutions provided by they hyperscale cloud vendors, with services including AWS RedShift and Google BigQuery. These products bought the benefits of cloud such as scale, elasticity and consumption based pricing into the Data Warehousing realm, making Data Warehousing suitable for more modern requirements such as managing machine generated data, machine learning use cases or real time requirements.
Outside of the major cloud providers, we also saw Snowflake emerge as a cloud agnostic Data Warehousing technology built specifically to take advantage of the Cloud. Snowflake is the most rapidly growing cloud native Data Warehouse and has seen rapid adoption in industry due to it’s power and ease of use and deployment.
Modern Data Warehouses have made Data Warehousing much more economically viable for businesses and led to their adoption by smaller businesses and even startups.
Historically, a Data Warehouse would have required a signifcant up-front investment in terms of hardware and software licenses. This would be a capital purchase and would need to be sized based on predicted future workloads. More expensive still would be the teams of engineers with the skills to operate the Data Warehouse and implementing all of the supporting ingestion and transformations required.
With Modern Data Warehousing, there is typically no up-front cost. Data Warehouses can be created in minutes through a Web GUI and are often procurred with a credit card. The cost will then only rise based on consumption, for instance the amount of data stored, the number of queries served, or the amount of compute power required.
The operational overhead of managing the Data Warehouse is also much reduced, with the cloud provider taking care of initial deployment, upgrades, backups, optimisations and other such requirements as part of the monthly fee.
Though there still of course costs with the platform, they are significantly lower from a total cost of ownership perspective in comparison with traditional approaches.
One of the distinctions that is often used when discussing database technology and data warehouses is OLTP vs OLAP. It is therefore worth a quick diversion into what these terms mean.
As we know, there are hundreds if not thousands of database products in the market, all of which are suited for different use cases and types of data. The market can however almost be split down the middle between OLTP and OLAP databases.
OLTP stands for Online Transaction Processing. Broadly, databases in this category are designed for transactional systems where we have a large number of concurrent users inserting, updating and deleting data in small batches.
Some of the implications of designing for OLTP include:
- They are optimised for speed;
- They are optimised for working with single row inserts, updates and deletes as opposed to large batch updates;
- They are designed for multiple users using the database in parallel, incorporating transactions.
Common OLTP databases include Oracle, SQL Server, MySQL and Postgres.
OLAP stands for Online Analytical Processing. These databases will be designed for a situation where we have a smaller number of people executing queries over very large datasets.
Some of the implications of designing for OLAP include:
- They are optimised for working with large batches of data rather than small - both inserts and queries;
- They do not necessarily have support for transactions and multi user concurrency control on inserts;
- The data is stored in a way that makes analytics faster and easier;
- They are not optimised for writes. The vast majority of interactions with the OLAP database will be reads and queries, meaning that there is more emphasis on this.
Common OLAP databases include Snowflake, Apache Druid and Apache Pinot.
Software Engineers will usually finding themselves working with OLTP databases, but in the Data Engineering world, most of our interactions will inevitably be with OLAP databases which we will use to host data for analytical use cases.