In this lesson we will:
- Introduce Apache Kafka;
- Cover some of it's common use cases;
- Highlight it's key features and properties;
- Discuss the open source and commercial distributions and managed service options;
- Refer to some of the competitors to Kafka in the messaging and event streaming space.
Apache Kafka is a data streaming platform which allows you to publish, distribute and consume streams of data with high performance, scalability and reliability.
An example use case for Kafka might be distributing the latest prices of stocks on a stock exchange to thousands of mobile application clients in real time. Kafka provides the data exchange and messaging capability for use cases like this where speed, scalability and reliability are essential.
For those familiar, Kafka can be thought of as an evolution of traditional message brokers such as Tibco Rendezvous, IBM MQ or RabbitMQ. However, Kafka is much more scalable and performant than previous generations of messaging technology, and has some important architectural evolutions which we will cover in this course.
Kafka is very widely deployed and is by far the leading platform for streaming data integration in use in industry today. It is therefore an important area of knowledge for aspiring Data Engineers looking to build modern real-time data platforms.
Though Kafka can be used for many diverse data requirements, some of the most common use cases include:
- Real Time Streaming - e.g. Streaming real time data from server processes to web or mobile client applications;
- SOA or Microservice Integration - e.g. Integrating services which need to exchange data or actions to complete a business process;
- Data Exchange - e.g. Communicating data between systems or organisations;
- ETL - e.g. Taking data from a source to a destination data repository such as from your application into your Data Lake or Data Warehouse;
- Real Time BI & Analytics e.g. - Calculating metrics and analytics that allow you to monitor the state of your business in real time.
Data integration scenarios like this occur across all industries. For instance, ecommerce, financial services, IOT and online advertising are all likely to have business requirements in this sphere. And where they are already have a solution in place, it is highly likely that they have already built on Kafka.
Though there are many ways to exchange data between systems, Kafka offers a number of features and characteristics which together make it very compelling as a platform:
- Real Time - Kafka can distribute messages from producer to consumers in real time, immediately and continuously as data is created at the source;
- Performance - Kafka can move messages from source to destination with very low latency, typically in the order of milliseconds;
- Scalability - Kafka can accept thousands of connections from publishers and consumers and handle all of the connections in a very resource efficient manner. It can also be scaled by adding multiple servers in to a cluster where extra scale is required;
- Reliability - Kafka can be used in a way such that messages are never lost, and are always delivered exactly once. It will also handle a range of failure scenarios and situations such as consumers and producers that temporarily crash and need to recover from where they left off;
- Ordering - Kafka introduces semantics whereby we can ensure that events are processed and received in order. This can be an important property that you rely upon which could make applications that you develop simpler;
- Audit - Kafka can provide a central point for auditing and recording data events. It can be configured to retain data for a set periods, giving it similar properties as a database and a useful log as to what actually happened;
- Security - Kafka introduces additional security controls such as the ability to encrypt data in transit.
Though it is possible to implement data exchange and distribution without a platform like Kafka, it would be very complex and require significant engineering to achieve a similar level of capability using a more bespoke approach.
Apache Kafka is an open source platform and free to download, modify and deploy.
There are however commercially supported and managed distributions and services, such as those from Confluent, who are the main commercial and technical supporters behind the development of Kafka.
Of particular note is Confluent Cloud, which is a fully managed Software-As-A-Service platform for Kafka. Using Confluent Cloud avoids the need for you to configure and run your own Kafka clusters and simply use Kafka as a service with a consumption based billing model.
Though Kafka has by far the most market share, it is not the only tool in this space.
As mentioned, there were an entire generation of messaging technology including products such as Tibco Rendevouz, IBM MQ or RabbitMQ which were widely deployed in these space. Though these are all still under development, they are based on older architectures and came from an era of on premise enterprise deployment. They are also not as scalable and performant as Kafka.
After Kafka, we have seen new products emerge with a more modern architecture, and which build on the lessons from Kafka and compete with it in various ways. Red Panda is a Kafka compatible streaming engine which claims to outperform Kafka. Apache Pulsar is also a modern streaming engine which has a similar value proposition but adds some innovations such as serverless functions, load balancing and geographical replication.