In this lesson we will:
- Learn about batch loading data into snowflake;
- Learn about staging areas.
In order to make use of Snowflake, we of course need to populate it with data.
Though there are many ways to do this, the most common requirement in the data warehousing world is to regularly perform batch uploads of data files which have been exported from other systems and applications.
For instance, a typicaly requirement might be that every hour or every day, an extract of orders could be taken from some eCommerce system and uploaded into the data warehouse.
Though businesses are looking to populate their warehouse faster, in smaller batches or using real-time streaming, batch data integration is still the most common approach and something a Data Engineer needs to learn about, understand and likely implement.
When we batch load data into Snowflake, we typically follow a two step process:
Firstly, we need to upload the data from your source to a "staging area". This involves taking your source data, which would typically be stored in a format such as CSV, TSV, Parquet or JSON, and uploading it into the staging area ready for processing;
Secondly, we perform a COPY INTO operation to take data from that staging area and copy it into the Snowflake table in a clean relational format.
At the COPY INTO phase, the data is then managed by Snowflake, so the source files can safely be deleted from the staging area.
Note that this needs to be a continuous process. Ordinarily, we will load all of the historical data during the first migration, then need to setup a process to continually ingest new data as it is created and pushed into the staging areas.
With regards to the staging area, there are a few options open to us.
Snowflake has two types of staging areas - External or Internal.
External staging areas are storage areas outside of control of Snowflake. Typically they will exist on cloud storage such as AWS S3 or Azure Blob Store, but will be created directly within the cloud infrastructure. This could for instance be a data lake which you have already created and use for other purposes.
Internal staging areas are managed directly by Snowflake. They will likewise be stored on the cloud providers object storage, but will be created and maintained by Snowflake.
In many cases, your data will already be stored on cloud object storage. At which point, we may as well just do our load directly from those external stage rather than go via an internal stage.
If this isn't the case, the preference is usually to uploiad via an Internal staging area. This way, we will benefit from compression, the ability to query the data in a slightly more performant manner, and benefit from improved audit features. It also means that we could potentially avoid interacting with the cloud provider and leave this to be brokered through Snowflake, keeping everything simple.
If you go down the route of using internal stages, there are then 3 different subtypes:
- User Stage is an area available only to the currently logged in user. Data stored in this stage is private to the logged in user and can not be seen by any other users;
- Table Stages are associated with a specific table. Every table has 1 and only 1 table stage which is created at table creation time;
- Named Stages are created separately and are not tied to any particular table or view.
The decision of where to stage from is sometimes quite subtle.
- User Stage: It makes sense to use this where an individual data engineer or analyst is loading their data, and has no need to share with anybody else within the organisation. This could be the case during development, whereby an individual data engineer is working on the ETL process before moving to a shared development, test, or production environment;
- Table Stage: When data has to be shared between multiple Snowflake users, and has a one to one mapping with the table, this is the natural option to reach for when staging data. Some teams will also have a workflow whereby they develop within their user stage, then promote the change to the table stage as the code matures;
- Named Stage: Unlike the user and table stages, we can also specify additional details on the named stage which describe, for instance, the format of the data at load time instead of COPY INTO stage. Named stages are also specific database objects, so can be created, dropped, and have role based permissions applied. These stages tend to be more complex to work with, but more explicit for production pipelines.
Choosing the best staging option is not always an easy decision:
- Use The Correct Staging Area Type: Snowflake has different types of staging areas, including internal and external stages, and user, table, and named stages. It's important to choose the best option for your requirements. Usually, this would be an access-controlled named stage for production code.
- Organise Staging Into Folders: Many people use staging areas a buckets of files, without realising that you can PUT into a directory in order to better to organise your data.
- Purging: When we stage data, there is a PURGE=TRUE option which removes the data after it has been successfully staged.
- Optimise Size Of Staged Files: There is a sweet spot for the size of the file for maximum throughput and performance. Consensus is that your data should be split into 50-200mb files, though the optimal point for you will vary on things like the compute available and structure of the data. Avoiding breaking your data into thousands of very small files, or uploading single very large files is likely to be sub-optimal however.
- Use A Separate Warehouse For Data Staging: It is worthwhile creating a separate data warehouse for your loading process in order to avoid hurting performance for people querying your database. This additional warehouse can be suspended immediately after the data load in order to reduce costs.
Following these best practices, your Snowflake imports will be better organised and more maintainable.