Get Started Free
‹ Back to courses
Untitled design (21)

Tim Berglund

VP Developer Relations

What Is Apache Kafka?

Apache Kafka is an event streaming platform used to collect, process, store, and integrate data at scale. It has numerous use cases including distributed logging, stream processing, data integration, and pub/sub messaging.

In order to make complete sense of what Kafka does, we'll delve into what an "event streaming platform" is and how it works. So before delving into Kafka architecture or its core components, let's discuss what an event is. This will help explain how Kafka stores events, how to get events in and out of the system, and how to analyze event streams.

What Are Events?

An event is any type of action, incident, or change that's identified or recorded by software or applications. For example, a payment, a website click, or a temperature reading, along with a description of what happened.

In other words, an event is a combination of notification—the element of when-ness that can be used to trigger some other activity—and state. That state is usually fairly small, say less than a megabyte or so, and is normally represented in some structured format, say in JSON or an object serialized with Apache Avro™ or Protocol Buffers.

Kafka and Events – Key/Value Pairs

Kafka is based on the abstraction of a distributed commit log. By splitting a log into partitions, Kafka is able to scale-out systems. As such, Kafka models events as key/value pairs. Internally, keys and values are just sequences of bytes, but externally in your programming language of choice, they are often structured objects represented in your language’s type system. Kafka famously calls the translation between language types and internal bytes serialization and deserialization. The serialized format is usually JSON, JSON Schema, Avro, or Protobuf.

Values are typically the serialized representation of an application domain object or some form of raw message input, like the output of a sensor.

Keys can also be complex domain objects but are often primitive types like strings or integers. The key part of a Kafka event is not necessarily a unique identifier for the event, like the primary key of a row in a relational database would be. It is more likely the identifier of some entity in the system, like a user, order, or a particular connected device.

This may not sound so significant now, but we’ll see later on that keys are crucial for how Kafka deals with things like parallelization and data locality.

Use the promo codes KAFKA101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud storage and skip credit card entry.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Introduction

Hey, I'm Tim Berglund with Confluent. I wanna tell you everything you need to know, and only what you need to know about Apache Kafka. Apache Kafka is an event streaming platform used to collect, store and process real time data streams, at scale. It has numerous use cases, including distributed logging, stream processing and Pub-Sub Messaging. And that sounds like something that a committee of MBAs would write if they had their persona document in front of them, and we're just trying to nail the messaging, and that's not what we're gonna do in this series. But all those words are true, that really is kind of a nice one sentence description of what Kafka is, but there's so much in there that we have to expand on for any of this to make sense. I mean, even the phrase event streaming platform, that's totally accurate, requires a bit of a journey before the full significance of the words really land on you. And these videos are that journey. To begin with, I wanna start with just the idea of an event. It's worth just thinking about what an event is. Once we do that, then we can talk about how Kafka stores events, how events get in and out, how to analyze them, all that stuff. But first, we have to agree on what an event is. Now, an event is just a thing that has happened, that's it. And I know that sounds a little abstract, but that really is true. It can be any kind of thing. My go-to example is, a smart thermostat phoning home to report the current temperature and humidity and status of the HVAC system in the house, like that's an event. But an event can be other kinds of things. An event can be the change in the status of some business process, say an invoice becomes past due, well that's an event. An event can be some kind of user interaction, somebody is mousing over a certain link on a screen or clicking on a thing. That's certainly an event. A microservice completes some unit of work and wants to put the record of that unit of work somewhere. That's an event. All these things are events. They're just things that have happened combined with the description of what happened. So, an event is a combination of notification. That's the element of when-ness to the thing that can be used to trigger some other activity, it's a notification and state. Now the state of an event is usually fairly small, say less than a megabyte or so, in concrete terms. And is normally represented in some structured format, like JSON, or JSON Schema, or Avro, or Protocol Buffers or something like that. The state is serialized in some usually standard format. Now Kafka has a little bit of a data model for an event. An event in Kafka is modeled as a key/value pair. Internally, inside Kafka, when these things are actually stored, keys and values are just sequences of bytes. Kafka internally is loosely typed, but externally outside, like you're not, I mean, just look at you and your programming language that you're using, whatever it is, is probably not that loosely typed. There's probably some kind of structure to the data. And so going back and forth between the way that key/value pair, that event is represented in your languages type system, and the representation inside Kafka. Kafka famously calls that, the process of serialization and de-serialization, we came up with those words ourselves. And again, that serialized format is usually like JSON, or JSON Schema, Avro, Protobuf, something like that. And the value, that serialized object, is usually the representation of an application domain object, or some form of raw message input, like the output of a sensor or something like that. So that's why that structure of that thing's important 'cause in your world, as you think about it, it probably has some structure. Now the key part, I said, a message is a key/value pair. Keys in Kafka can be a fairly rich topic, I'm gonna summarize them very simply right now. They can be complex domain objects, serialized with all those same formats, but are often just primitive types like strings or integers. So the key part of a Kafka object is probably not a unique identifier for the event. If you're thinking of like a primary key and a database table where the key uniquely identifies the row. The key in a Kafka message is not like that. It's more likely the identifier of some entity in the system, like a user or an order or a particular connected device, like the ID of that smart thermostat or something like that. And this may not sound significant right now, but we will see later on that keys are crucial for how Kafka deals with things like parallelization and data locality and things like that. So that's the very basics of Kafka, and the one sentence definition and the notion of events.