Apache Kafka Fundamentals

Apache Kafka is a distributed streaming platform and message broker that enables building real-time event-driven applications at scale.

What is Apache Kafka?

Apache Kafka is a distributed commit log and streaming platform. Think of it as a highly scalable, fault-tolerant messaging system that can handle millions of messages per second.

Originally developed at LinkedIn, Kafka is now an open-source Apache project used by thousands of companies including Netflix, Uber, Airbnb, and LinkedIn for building real-time data pipelines and streaming applications.

Core Concepts

Topics

Topics are named channels or categories where messages are published. Think of a topic as a message feed organized by category. Examples: user-registrations, purchase-orders, payment-events.

Topics are append-only logs. Messages written to a topic are never modified - they're just appended to the end.

Producers

Producers are applications that publish (write) messages to Kafka topics. In our workshop, the frontend application is a producer that publishes purchase order messages.

Frontend Application
        ↓ (publishes)
    purchase-orders topic

Consumers

Consumers are applications that subscribe to (read) messages from Kafka topics. In our workshop, the backend application is a consumer that reads and processes purchase orders.

purchase-orders topic
        ↓ (consumed by)
    Backend Application

Consumer Groups

Consumer Groups enable load balancing and fault tolerance. Multiple consumers can join the same group to share the workload of processing messages from a topic. Each message is processed by only one consumer in the group.

Partitions

Topics are divided into partitions for parallelism and scalability. Each partition is an ordered, immutable sequence of messages. Partitions enable:

Parallelism: Different consumers can read from different partitions simultaneously
Ordering: Messages within a partition maintain order
Scalability: Add more partitions to increase throughput

Brokers

Kafka runs as a cluster of servers called brokers. Each broker stores data for some partitions and handles read/write requests. Our Docker Compose setup runs a single broker on localhost:9092.

In This Workshop

We use the purchase-orders topic to demonstrate event-driven order processing:

Topic Configuration

Topic Name: purchase-orders
Purpose: Asynchronous purchase order processing
Producers: Frontend application
Consumers: Backend application
Message Type: PurchaseOrderDTO

Message Flow

1. User submits order in frontend
2. Frontend produces message to purchase-orders topic
3. Kafka stores message durably
4. Backend consumer receives message
5. Backend processes order and saves to MySQL

Serialization

Phase 1 (Current): JSON serialization using JsonSerializer / JsonDeserializer

Phase 2 (Future): Avro serialization with Confluent Schema Registry for better schema evolution

Why Use Kafka?

High Throughput: Handle millions of messages per second with low latency
Durability: Messages are persisted to disk and replicated across brokers
Scalability: Scale horizontally by adding more brokers and partitions
Decoupling: Producers and consumers are independent and don't know about each other
Reliability: Fault-tolerant with data replication and automatic failover
Real-Time: Process data as it arrives, not in batches

Kafka Guarantees

Order Guarantee: Messages in the same partition maintain order
Durability: Once acknowledged, messages are not lost (with proper configuration)
At-Least-Once Delivery: Messages are delivered at least once to consumers
Exactly-Once Semantics: Available with transactional producers (advanced feature)

Key Takeaways

Kafka is a distributed commit log for building event-driven systems
Topics organize messages by category; partitions enable parallelism
Producers and consumers are decoupled through Kafka topics
Partitions enable parallel processing while maintaining order within each partition