Learn Apache Kafka: Basic Concepts
This week I've been working with Kafka, and since it's new to me, I want to deep study the most important concepts and share these fundamental ideas with you. It's important to note that although I’m sharing key ideas here, to learn how to use Kafka, I recommend two things: first, create a test environment to understand and reinforce your knowledge, and second, deepen some concepts by reading books.
There are several aspects about Apache Kafka that I want to learn:
- Basic Concepts. This is about learning the essentials: what Kafka is, the idea behind it, the principles it follows, and what you need to know. This is the point I will address in this post.
- Architecture. As a platform engineer / SRE / DevOps / sysadmin, you need to learn the different systems to operate a production cluster: Kafka, Kafka Connect, Zookeeper. There are managed services like AWS MSK or Confluent if you want to consider them.
- Development. You need to know the APIs that allow publishing and consuming messages and understand how they are structured.
Remember that all companies run on data and continuously generate information. Therefore, knowing how to obtain, store, and process the necessary information is important if you want to make data-driven decisions.
Introduction to Kafka
Kafka is a publish/subscribe messaging system for distributing events (messages in Kafka) in real-time, distributed in what are known as topics. It was developed by LinkedIn as they lacked a component to manage the continuous flow of data (events), independently of classic entity storage. Later, Kafka was donated to the Apache Foundation.
Kafka is designed to scale horizontally and be reliable and fault-tolerant. Some popular use cases include communication between microservices, applications based on streaming data processing, and data pipelines.
A message is the data unit in Kafka. It resembles a row or record in a traditional database. A message is represented as a byte array, so it has no specific format or meaning for Kafka. Messages can have a key used to partition these data. Messages are written in batches to optimize cost, with the tradeoff between latency and performance defined by the batch size.
Although messages do not have a specific format for Kafka, it is recommended to define a schema for easy recognition. This can be done using JSON or XML. Apache AVRO is also recommended as it offers several advantages for these data.
A topic persistently stores data on disk and in a replicated manner. A topic can contain a lot or a little information and can retain it indefinitely or temporarily (e.g., with a deletion policy by time or size). Each topic contains a series of events that model what happens in our system in real-time: for example, a user adds a product to the cart. Topics serve to allow different systems to communicate with each other by consuming or publishing events.
Topics are divided into partitions, where messages are written. Partitions are used to improve system scalability, distributing them across multiple servers so messages can be written to several partitions simultaneously. Kafka guarantees message order within a partition but not in a topic with multiple partitions. Partitions can be replicated for redundancy, so if a broker fails, another can become the leader and receive writes.
Producers write to topics, and messages are distributed among partitions using the message's partition key. Consumers read messages and keep track of the last message they have read (offset). The offset is an integer that Kafka adds to each message in a partition.
A consumer group consists of one or more consumers that, at a minimum, consume from one partition. A consumer can consume one or more partitions, but a partition will never have multiple consumers and will store the last offset it processed.
Architecture
A Kafka server is known as a broker, and a set of brokers form a Kafka cluster, so if one server fails, our information is not lost, and our systems remain operational.
In many traditional systems, calculations are done in batch processes overnight, but sometimes this becomes insufficient, and you need real-time information. For this, you can have dashboards that consume information from Kafka topics.
In summary, Kafka is a distributed system that allows the management, storage, and processing of messages in real-time so that other systems can communicate through topics, either by consuming or publishing messages, allowing us to have real-time information about what is happening in our system.
Additionally, Kafka allows you to perform transformations, filtering, etc., and has several APIs like Kafka Connect and Kafka Streams.
Kafka Connect is a crucial component as it allows us to have pre-implemented connectors that integrate Kafka events with other systems through declarative configuration. They are usually divided into two categories:
- Sources: integrate with an existing tool and generate data events. For example, changes in entities that occur in a database.
- Sinks: integrate with third parties to emit information entering a topic. For example, writing information to persistent storage like Amazon S3.
Note that Kafka uses (and needs) Apache Zookeeper to store cluster metadata and coordinate the brokers.
Did you find this article useful? Subscribe to my newsletter and take the first step to launch IT products faster. You will receive exclusive tips that will bring you closer to your goals.