In this lesson we will:
- Explain how and why Cloud platforms such as AWS, Azure and Google Cloud Platform enable and fit into the Modern Data Stack;
- Describe the key characteristics of cloud and how it supports data and analytics initiatives;
- Consider whether a Modern Data Platform could be built on-premise.
Historically, businesses owned and ran their own compute infrastructure, servers and data centres in order to power their applications and manage their data.
Over the last 10-15 years however, companies have increasingly been outsourcing their infrastructure to external suppliers such as Amazon Web Services and Microsoft Azure, renting compute capacity on a Pay-As-You-Go basis.
As evidenced by the extraordinary growth of the cloud computing providers, there are signinficant benefits to this for businesses. Using the cloud means there is less non-differentiating technology to manage, less capital expenditure, and the ability to bring new software and customer experiences to market much faster than previously.
Whilst cloud started with compute capacity in the form of virtual servers, over time there has been a natural evolution into the data realm, with cloud providers offering capabilites such as transactional databases, data warehouses, NoSQL solutions, object stores and other features which a business can use to build their data and analytics infrastructure.
Though this has been a big step for businesses who are sometimes nervous at putting their data into the hands of a third party, techically it is a very good fit as modern data and analytics requirements can be difficult in a private data centre.
In this section we will introduce the key features and characteristics of cloud, with a particular lens on what they mean for Data and Analytics and the Modern Data Platform:
The key feature of the cloud is that is managed by a third party such as Amazon, Google or Microsoft. This means that businesses need not worry about data centres, servers, physical networking and only concern themselves with higher level activities that are differentiating for their business.
As well as servers, companies that use the cloud can procure higher level "Managed Services" such as databases, web servers, message queues and of course Data and Analytics tooling. This is all available almost instantly and with significantly less planning and operational overhead.
From a data and analytics perspective, this means that businesses need to spend less time on activities such as installations, upgrades, backups and performance optimisation, as to an extent a lot of it is managed for them by the cloud provider.
Where before there might have been a months-long lead time for new infrastructure or capacity, business can instead request more compute capacity from their cloud provider at the click of a button and have it ready to use in minutes.
For data and analytics, this means that we can scale up computing power and storage capacity at the click of a button, compared with the old world of waiting for months for new infrastructure.
Nowadays, businesses need to store and analyse more data from ever before. This includes more line of business application data and interactions with customers through digital channels, but also "big data" sources such as datasets captured from logs, click streamins, IOT devices etc.
Cloud computing gives us this underlying platform which can scale almost infinitely to support these workloads. This includes both the ability to physically store these large volumes of data for subsequent analysis, and access to on demand compute power for activities such as data transformations or analytical processing.
Working in the cloud allows businesses to scale up and down capacity as required. For instance, a retailer could add capacity over Black Friday or the main Holiday period to power their systems.
A common approach in modern Data and Analytics platforms is to periodically and temporarily create a cluster of compute to run batch jobs overnight or before the business day starts, then turn this cluster off. This temporary access to large amounts of compute power paid for byt he minute is a very cost effective model.
Historically, businesses needed to make large capital investment in infrastructure to support their applications. In the cloud, there are effectively no up-front Capital invesments, which is instead replaced by ongoing platform based fees.
In the cloud, all of the infrastructure described above becomes software defined and programmable.
As well as rquesting infrastructure "at the click of a button", it also be requested thorugh an API by passing some configuration parameters.
By definiing this infrastructure in code, new environments can be created which are perfect replicas of each other.
The Modern Data Stack is very much enabled by the properties of Cloud computing.
All of a sudden, we do not need to worry about creating servers, we can spin them up instantly and with unlimited scale and elasticity.
A lot of these properties then flow upstream to the data tools that we interact with.
Though it is of course possible to create on premises Data Lakes and Data Warehouses, everything becomes that bit harder.
A lot of the technology advocated as part of the Modern Data Stack is cloud based. Tools such as Fivetran, a popular ETL tool, and Snowflake, a Cloud Data Platform are provided only through a Software As A Service Model. This means that if you are limited purely to your own on-premises data centre, these platforms are not avaiable to you.
Then, we would be operating in an environment without the scale and elasticity.
In short, if you wish to build a Modern Data Platform and gain all of the benefits of that approach, you likely do need to embrace cloud. In a large enterprise, this might have some pre-requisites for you such as getting comfortable with storing data in the Cloud,