‣ Apache Kafka® is a distributed streaming platform. A streaming platform has three key capabilities
1. Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
2. Store streams of records in a fault-tolerant durable way.
3. Process streams of records
‣ Kafka is generally used for two broad classes of applications
1. Building real-time streaming data pipelines that reliably get data
between systems or applications
2. Building real-time streaming applications that transform or react to the streams of data
‣ A few key concepts of Kafka
– Kafka is run as a cluster on one or more servers that can span multiple datacenters.
– The Kafka cluster stores streams of records in categories
– Each record consists of a key, a value, and a timestamp.
Kafka has four core APIs
– The Producer API allows an application to publish a stream of records to one or more Kafka topics.
– The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
– The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
– The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
‣ In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.
INTRODUCTION: TOPICS & LOGS
‣ A topic is a category or feed name to which records are published.
‣ Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
‣ For each topic, the Kafka cluster maintains a partitioned log that looks like this:
‣ Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log.
‣ The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition
‣ The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period.
‣ Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.
‣ This offset is controlled by the consumer. Normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes.
‣ The partitions in the log serve several purposes
– First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data.
– Second they act as the unit of parallelism—more on that in a bit.
‣ The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.
‣ Each partition is replicated across a configurable number of servers for fault tolerance.
‣ Each partition has one server which acts as the “leader” and zero or more
servers which act as “followers”.
‣ The leader handles all read and write requests for the partition while the
followers passively replicate the leader.
‣ If the leader fails, one of the followers will automatically become the new leader.
‣ Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
‣ Kafka MirrorMaker provides geo-replication support for your clusters.
‣ With MirrorMaker, messages are replicated across multiple datacenters or cloud regions.
‣ You can use this in active/passive scenarios for backup and recovery
‣ or in active/active scenarios to place data closer to your users , or support data locality requirements
‣ Producers publish data to the topics
‣ The producer is responsible for choosing which record to assign to which partition within the topic.
‣ This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).
‣ Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
‣ Consumer instances can be in separate processes or on separate machines.
‣ If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
‣ If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
‣ If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
‣ Kafka only provides a total order over records within a partition, not between different partitions in a topic.
‣ Per-partition ordering combined with the ability to partition data by key is sufficient for most applications.
‣ If you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
‣ Kafka gives the following guarantees
– Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
– A consumer instance sees records in the order they are stored in the log.
– For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
INTRODUCTION: KAFKA AS A MESSAGING SYSTEM
‣ Messaging traditionally has two models: queuing and publish-subscribe.In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers.
‣ Each of these two models has a strength and a weakness.
‣ The strength of queuing is that it allows you to divide up the processing of
data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber—once one process reads the data it’s gone.
‣ Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
INTRODUCTION: KAFKA AS A MESSAGING SYSTEM
‣ The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
‣ The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
‣ Kafka has stronger ordering guarantees than a traditional messaging system, too.
‣ A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored
‣ Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group
INTRODUCTION: KAFKA AS A STORAGE SYSTEM
‣ Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages.
‣ Data written to Kafka is written to disk and replicated for fault-tolerance.
‣ Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
INTRODUCTION: KAFKA FOR STREAM PROCESSING
‣ It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
‣ In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
‣ It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API.
‣ This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
‣ The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
USE CASES: MESSAGING
‣ Kafka works well as a replacement for a more traditional message
‣ Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc).
‣ In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
‣ In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.
USE CASES: WEBSITE ACTIVITY TRACKING
‣ The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.
‣ This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type.
‣ These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
‣ Activity tracking is often very high volume as many activity messages are generated for each user page view.
USE CASES: METRICS
‣ Kafka is often used for operational monitoring data.
‣ This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
USE CASES: LOG AGGREGATION
‣ Many people use Kafka as a replacement for a log aggregation solution.
‣ Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing.
‣ Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.
‣ This allows for lower-latency processing and easier support for multiple
data sources and distributed data consumption.
‣ In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
USE CASES: STREAM PROCESSING
‣ Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and
then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.
‣ For example, a processing pipeline for recommending news articles might
crawl article content from RSS feeds and publish it to an “articles” topic; further processing might normalize or deduplicate this content and published the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics.
‣ Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.
USE CASES: EVENT SOURCING
‣ Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
‣ Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style
USE CASES: COMMIT LOG
‣ Kafka can serve as a kind of external commit-log for a distributed system.
‣ The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.
‣ The log compaction feature in Kafka helps support this usage.
‣ In this usage Kafka is similar to Apache BookKeeper project.