Apache Kafka Fundamentals
Apache Kafka is a distributed streaming platform and message broker that enables building real-time event-driven applications at scale.
What is Apache Kafka?
Apache Kafka is a distributed commit log and streaming platform. Think of it as a highly scalable, fault-tolerant messaging system that can handle millions of messages per second.
Originally developed at LinkedIn, Kafka is now an open-source Apache project used by thousands of companies including Netflix, Uber, Airbnb, and LinkedIn for building real-time data pipelines and streaming applications.
Core Concepts
Topics
Topics are named channels or categories where messages are published. Think of a topic as a message feed organized by category. Examples: user-registrations, purchase-orders, payment-events.
Topics are append-only logs. Messages written to a topic are never modified - they're just appended to the end.
Producers
Producers are applications that publish (write) messages to Kafka topics. In our workshop, the frontend application is a producer that publishes purchase order messages.
Frontend Application
↓ (publishes)
purchase-orders topic
Consumers
Consumers are applications that subscribe to (read) messages from Kafka topics. In our workshop, the backend application is a consumer that reads and processes purchase orders.
purchase-orders topic
↓ (consumed by)
Backend Application
Consumer Groups
Consumer Groups enable load balancing and fault tolerance. Multiple consumers can join the same group to share the workload of processing messages from a topic. Each message is processed by only one consumer in the group.
Partitions
Topics are divided into partitions for parallelism and scalability. Each partition is an ordered, immutable sequence of messages. Partitions enable:
- Parallelism: Different consumers can read from different partitions simultaneously
- Ordering: Messages within a partition maintain order
- Scalability: Add more partitions to increase throughput
Brokers
Kafka runs as a cluster of servers called brokers. Each broker stores data for some partitions and handles read/write requests. Our Docker Compose setup runs a single broker on localhost:9092.
In This Workshop
We use the purchase-orders topic to demonstrate event-driven order processing:
Topic Configuration
Topic Name: purchase-orders
Purpose: Asynchronous purchase order processing
Producers: Frontend application
Consumers: Backend application
Message Type: PurchaseOrderDTO
Message Flow
1. User submits order in frontend
2. Frontend produces message to purchase-orders topic
3. Kafka stores message durably
4. Backend consumer receives message
5. Backend processes order and saves to MySQL
Serialization
Phase 1 (Current): JSON serialization using JsonSerializer / JsonDeserializer
Phase 2 (Future): Avro serialization with Confluent Schema Registry for better schema evolution
Why Use Kafka?
- High Throughput: Handle millions of messages per second with low latency
- Durability: Messages are persisted to disk and replicated across brokers
- Scalability: Scale horizontally by adding more brokers and partitions
- Decoupling: Producers and consumers are independent and don't know about each other
- Reliability: Fault-tolerant with data replication and automatic failover
- Real-Time: Process data as it arrives, not in batches
Kafka Guarantees
- Order Guarantee: Messages in the same partition maintain order
- Durability: Once acknowledged, messages are not lost (with proper configuration)
- At-Least-Once Delivery: Messages are delivered at least once to consumers
- Exactly-Once Semantics: Available with transactional producers (advanced feature)
Key Takeaways
- Kafka is a distributed commit log for building event-driven systems
- Topics organize messages by category; partitions enable parallelism
- Producers and consumers are decoupled through Kafka topics
- Partitions enable parallel processing while maintaining order within each partition