In this lesson we will:
- Discuss techniques for sourcing events data from pre-existing or legacy systems;
- Introduce specific tools and frameworks for doing this.
In order to move towards Event Streaming and Event Based Architecture, the first thing we need to do is source data from a source application or database and publish it onto an event stream.
Even this first step can be difficult, because many applications do not “publish” events as data is entered by users. Instead, these applications interact directly with a proprietary relational database specific to their application where the data then remains trapped.
This is equally true for line of business applications, SaaS platforms such as Salesforce, and even web analytics platforms such as Google Analytics. The data is simply hard to extract as real time streams.
For this reason, we often need to implement the technology to source real time event streams ourselves. Sometimes we can do this by making changes to our source application, and sometimes we have to interrogate the source application through APIs or by interacting with their database and database extracts in order to get them.
This sounds like a lot of work and even a barrier to moving towards streaming architectures. However, we have experienced that it is no more onerous than extracting data using traditional ETL tools and approaches whilst moving organisations to a much more modern data architecture.
So our task as data engineers is to take these unfriendly sources, and turn them into streams.
Fortunately, there are a few technologies and platforms which will help us to achieve this so we aren't starting from scratch each time:
Kafka Connect other message queues for both reading and writing.
Kafka Connect could for instance connect to a database via JDBC, send all of the historical records, then poll the database every minute for new updates. This is ultimately database polling and is not real time, but is likely to be much faster than periodic batch exchanges and at least starts us moving towards events and continuous data integration.
As well as Databases, Kafka Connect supports things like traditional message queues or connecting to applications such as SalesForce or ServiceNow, giving you a consistent mechanism for integrating data.
Debezium allows us to extract events from popular databases such as MySQL. Rather than connecting at a JDBC level as with Kafka Connect, Debezium works at a lower level and listens to databases transaction logs in order to turn them in a real time stream of events. This is more performant, lighter on the database, and more real time. This technique is often referred to as Change Data Capture.
Singer is referred to as the open-source standard for writing scripts that move data. It allows you to write simple scripts which listen to data sources and then turn the updates into JSON based events which can be published as they are found.
Most data engineering has historically involved batch extract, transform and load. Tools such as StreamSets and Striim are an evolution of this idea, but are more modern, cloud native / SaaS tools which can both be used to deliver batch and continuous data from sources to endpoint reliably. These tools are based on visual workflows and use approaches which will be more familiar to the traditional ETL engineer.
Many people think that the readiness of their applications is a barrier to moving towards streaming architectures, but as this article shows, there are a many routes to extracting data from data sources which are not at first geared up to it.
By deploying these technologies, you'll be able to integrate your data faster, build real time analytics and implement event processing scenarios where systems automatically respond to captured events. Though streaming data integration is a low level concern, it can have enormous impact on your customer experience and business efficiency.
Fortunately, there are two key open source technologies we have found to be highly successful in this regard, which any team looking to evolve towards event based architecture should investigate.
The first is Debezium. Debezium allows you to Stream changes from your databases such as MySQL or Postgres using a technique called "Change Data Capture" or CDC. As records are inserted, updated and deleted, events can be created and pushed to a destination such as Kafka where they can be processed in a decoupled manner. This completely avoids changes to the legacy application and should not affect the performance or stability of the application. It's therefore a very quick win to mvong towards event orientation.
The second is FluentD. FluentD enables us to data log sources such as log files, SysLog or application runtimes, and push to a destination such as Kafka as they are created. Again, we can hopefully evolve towards events with minimal if any application changes via this manner.
Though this problem can be solved in various ways, these open source technologies have proven to be performance and extremely stable under high transaction volumes. We think these technologies will win out be a key part of the event driven journey within enterprise.