In this lesson we will:
- Provide a more in depth overview of Snowflakes "Cloud Native" architecture;
- Introduce key concepts such as Snowflakes seperation of compute and storage and the virtual warehouse.
As discussed in the previous lesson, the Snowflake architecture is an evolution over traditional data warehouses, modernised to take advantage of the properties of cloud.
Some of the notable features of this architecture include:
As discussed, Snowflake is a fully Software-As-A-Service platform. This means that there is nothing to deploy and manage on your own servers or in your own cloud accounts.
Though Snowflake is a fully SaaS solution, behind the scenes it is running in one of the major cloud providers infrastructure - AWS, Azure or GCP.
Though this hosting is somewhat abstracted away from the Snowflake user or administrator, it means that Snowflake can take advantage of the inherent properties of the cloud such as it's massive scalability, elasticity and consumption based pricing model.
Separation of Storage and Compute is one of the most commonly cited benefits of Snowflake and one of the key things to understand about it's architecture.
In a traditional database such as Oracle or MySQL, storage and compute were historically tied together. If we wanted more storage, we would add another server process which both stored data locally and could be used to serve queries. In other words, storage and compute were deployed together in a tightly integrated way. This was due to the low latency requirements of transactional systems and slow or unreliable network connectivity with remote storage.
In recent years, compute and network performance has improved to the extent that we can now realistically store data across a network from compute processes to decouple the two. With data warehousing workloads such as Snowflake, we can also tolerate some latency when transferring results. This makes seperating compute from storage viable.
What this seperation means in practice is that that the storage and the compute can scale independently. For instance, we could have an extremely large dataset and a very small processing tier. Alternatively, we could have a a very small dataset and a very large processing tier to meet the needs of the business. This processing tier could then of course grow and shrink multiple times throughout the day, independently of the storage. This allows businesses to completely rightsize their usage in real-time, and save significant costs in doing so.
Traditionally, there would be one centralised Data Warehouse process or cluster which all users connected to in order to issue their queries and analytics. If more capacity was needed, this single cluster would need to have additional capacity added.
With Snowflake, we can instead create multiple virtual warehouses which all point at the single shared dataset. An organisation could have tens, hundreds or even thousands of virtual warehouses which can all work indepdenently to meet the needs of users.
A common deployment model is to give different business units such as marketing, finance or sales their own virtual warehouse. These can then be scaled to suit the needs of that particular business unit, and can be scaled up and down in line with their own usage patterns. For example, maybe finance need more horsepower during month end processes whilst operations need additional capacity to deal with the end of day shipping processes. This can all be scaled down or turned off when not in use.
The ability to start and stop multiple virtual warehouses, right-size them for particular user communities, and have them billed on a consumption basis is a very powerful and flexible model which wasn't viable with earlier generations of Data Warehouse technology.
The Virtual Warehouse arrangement of separating out data into a different tier which is then shared by multiple warehouses is referred to as a "Shared Disk" architecture.
At the same time, the practice of running multiple independent virtual warehouses is sometimes referred to as a Shared Nothing architecture.
This hybrid of shared disk and shared nothing is a really key component of what makes Snowflakes architecture especially powerful and cost effective to operate.
The Snowflake Architecture is best thought of in tiers like so:
At the lowest tier, we have the storage tier. This is provided by the cloud provider, through a service such as AWS S3 or Azure Blob Storage. As a Snowflake user or administrator, you won't be directly interacting with these services, though it's definitely worth having an appreciation of these services, how they work, the reliability guarantees they provide etc. As a Snowflake administrator, you may also be loading data from or extracting data to these cloud services.
At the second tier, we have a cluster of compute nodes which are responsible for executing the queries, ingesting data, carrying out data manipulations and other data processing. As discussed, this compute is structured as one or more virtual warehouses which are configured in line with real world usage patterns.
At the top tier, we have a set of cloud services which manage things such as authentication, security and access control, metadata etc. These services are internal to Snowflake, but we will touch upon them in our day to day usage and administration of Snowflake.