In this lesson we will:
- Introduce the concept of streaming data;
- Compare streaming data with the traditional batch data approach to data management.
Streaming data is data that is generated continuously and in high volumes. Common examples of this class of data include stock price updates, click stream events, logs, and data from IOT devices.
Streaming data is often, but not always, machine generated and usually consists of a high volume of relatively small events that are published immediately after they have been generated at the source. This can be contrasted with more traditional and transactional business data, which is generated slowly as humans interact with applications or websites.
Businesses increasingly have the requirement to process and respond to their streaming data in real time, as there is often some commercial or operational benefit to doing so. For instance, fraud detection, algorithmic trading and preventitive maintence are all examples where processing and responding to streaming data in real time and in an intelligent way has obvious business benefit.
Most data platforms deployed within businesses today are batch based. This means that data is collected, ingested and processed in batches of multiple records, typically on some schedule such as hourly or daily cycles.
Though batch based data processing is simple, its main downside is that it implies a delay before data is processed and gets into the hands of business users. Any real time and intelligent response is simply not viable when there are multiple delays as batches of data are moved through systems.
Though the delays and limitations of batch based processing are acceptible in many situations, businesses are becoming more ambitious and aspiring to real time solutions. They are therefore looking to evolve from their batch based data processing towards streaming data solutions.
This can however be a complex journey, with many systems needing to be updated and a lot of code needing to be modernised from front to back in order to support the volumes of data and the latency requirements of streaming.
Organisations undertaking this journey from batch data processing to streaming architectures is likely to be a key theme for data teams in the coming years, and Data Engineers with experience in this field will likely be in high demand.
Though streaming data and stream processing can be very valuable to businesses, processing and analysing it is challenging.
Firstly, streaming data implies very high volumes of events, which can be difficult to store and process. Usually, these will have to be processed with low latency, even when we have a sudden spike in data volumes. And of course when we are triggering events in the real world, accuracy and reliability is essential.
New techniques and development patterns are required in order to meet this requirement. Putting these into practice requires new skills which Data Engineers who are accustomed to working in the batch data world do not necessarily have today.
More details on the challenges associated with streaming data are covered in a later lesson.