In this lesson we will:
- Introduce the field of Data Engineering and the role of Data Engineers;
- Explain how Data Engineering adds value for businesses;
- Describe the end to end process of Data Engineering;
- Explain the relationship between Data Engineers, Data Analysts and Data Scientists;
- Discuss the skills gap that exists within the Data Engineering field.
Todays businesses collect incredible amounts of data, from sources including websites and mobile applications, internal line of business applications, connected devices and from external suppliers and partners. This data is being created faster and in greater volumes than ever before.
At the same time, businesses have increasingly ambitious demands to make use of their data. Their requirements include intelligent analytics, sophisiticated reports, real time dashboards, machine learning models and other operational and product initiatives powered by data.
This task of dealing with more and more complex source data, together with increasingly demanding business requirements, is giving many challenges to data teams who are tasked with turning their data into business value.
Data Engineers are the people within an engineering function with the task of bringing these two sides together - putting the processes and automation in place to continually turn business data into actionable analytics and insights.
Data teams have worked on this challenge for many years, using ETL tooling to move data between locations and transform it to meet their reporting requirements. The problem is that the traditional approaches and the tools that they use starting to reach their limits in the face of the challenges described above.
In response, Data Engineering is emerging as a practice, applying techniques that look more like Software Engineering to the problem, whilst making use of modern tools to create robust, tested and scalable data pipelines that are capable of the reliablity, scalability, performacne and security required today.
The key challenge of Data Engineering is to transform source data into data that is ready to be analysed or used by the business. This has 3 key components:
- Extracting data from the myriad of sources around the business, either pushing or pulling data on a continuous basis as it is generated at the source;
- Transforming data into joined up, cleaned, usable formats and structures, perhaps augmenting it with analytics during this process;
- Loading data into locations where it can subsequently be used. Often this will be a data warehouse or data lake which will service reports, dashboards or ad-hoc analysis for users.
In some businesses, the responsibilities of the Data Engineer will also extend into areas such as designing the correct reports and dashboards. However, often Data Analysts and Data Scientists will work on these “last mile” activities.
In addition to the ETL work, Data Engineers usually need to design, build and run the platform that they are working in. This requires ability to configure their tools of choice, and requires knowledge of the data warehouse or data lakes that they are targetting. The role of the Data Engineering role is therefore inherently more technical than the previous generation of ETL engineers.
Previously, data teams had to make use of heavyweight, proprietary, expensive, data centre centric tools that increasingly fail to meet the requirements of todays businesses and data teams.
Todays Data Engineer, on the other hand, is making use of the Modern Data Stack, including cloud hosted infrastructure, SaaS tooling, and modern databases and data lakes. These modern tools deliver a step change in scalability, reliablity, performance, security and cost effectiveness.
Though the Modern Data Stack is an order of magnitude more powerful than the traditional approach, this does make things challenging for Data Engineers, as not only are new practices emerging, they are also working in a totally new technology environment.
Data Engineering is not just a one time activity. Data Engineers need to put into place pipelines which continually process data as it is created, and to have these pipelines running reliably and accurately and delivering up to date data.
With data pipelines running in production, they will have the challenge of deploying changes and enhancements without introducing extensive downtime, and will need to implement practices such as pre-release testing to ensure that bugs and regressions are not introduced.
To aid with this, Data Engineering teams will also introduce monitoring and observability to ensure that errors are identitifed and high quality data is continuously delivered.
Many of these practices are only recent arrivals in the world of Data Engineering and demonstrate how it is increasingly adopting the best practices of software engineers.
Data Analysts and Data Scienitsts will often explain that at least 50% of their time is spent on data cleansing and preperation. They will often find, for instance, that they need to manually request raw data extracts from source systems, that the data they get is messy, out of date or has gaps in it, and may be delivered in sub-optimal formats such as Excel spreadsheets. Dealing with this is not a good use of their time.
By introducing a Data Engineering capability, the data plumbing activities are handled and productionised, freeing up the other data professionals with the time to apply their niche skills on actual analysis and modelling using up to data that is continually delivered through robust data delivery pipelines.
With this in mind, businesses should look at Data Engineering as a foundational capability which has to be put into place before Data Analysts and Data Scientists can be effective.
Though Data Engineering is increasingly a critical function for businesses, it is relatively new and poorly understood. Many organisations do not even understand that they need this capability, let alone what skills to hire for or how to attract the talent.
At the same time, it hasn't really existed as a job role in the wider market for too long, so there are not a huge community of identifiable engineers to hire into this position.
Fortunately, this is now changing and Data Engineering is growing in prominenence as a job role. Software Developers have begun to move in this direction, and people from data roles with skills in SQL and Python are also cross training to become Data Engineers. And as businesses realise the importance of the role, there are more and well paying employment opportunities for them. This is a nascient but growing field.
We also believe that there is a lack of resources to support someone motivated to cross train into Data Engineering. Compared with Software Engineering, there is less content and training courses for learning about Data Engineering. Ensemble is our attempt to improve this situation!