Confluent Kafka 2.x
Problem Statement
| Component | Configuration |
| Topic Partitions | 32 |
| Consumer Type | Kafka Streams (intermediate topic) |
| Deployment | StatefulSet with 8 replicas |
| Stream Threads | 2 per replica (16 total) |
| Expected Distribution | 2 partitions per thread |
Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.
Root Cause Analysis
Why This Happens
Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.
StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.
State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.
Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.
Solutions
1. Check for Static Group Membership
If you are using static group membership, the broker remembers the assignment even after pod restart.
# Check if this is set in your Kafka Streams config
group.instance.id=<some-static-id>
Fix: Remove it entirely or make it dynamic.
2. Proper Scale Down/Up with Timeout Wait
The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).
kubectl scale statefulset <statefulset-name> –replicas=0
sleep 60
kubectl scale statefulset <statefulset-name> –replicas=8
3. Delete the Consumer Group
⚠️ Warning: Only do this when ALL consumers are stopped.
# Scale down to 0
kubectl scale statefulset <statefulset-name> –replicas=0
# Verify no active members
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members
# Delete the consumer group
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete
# Scale back up
kubectl scale statefulset <statefulset-name> –replicas=8
4. Reset Consumer Group Offsets
Resets assignments while preserving current offsets:
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute
5. Force New Client IDs
Modify your StatefulSet to include a random/timestamp suffix in client ID.
6. Change Application ID (Nuclear Option)
Creates a completely new consumer group:
props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);
⚠️ Warning: This will create a new consumer group and reprocess from the beginning.
7. Enable Cooperative Rebalancing (Kafka 2.4+)
For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.
props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);
8. Tune Partition Assignment
Adjust these configurations for better distribution:
ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L
NUM_STANDBY_REPLICAS_CONFIG = 1
PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L
Diagnostic Commands
Check Current Consumer Group Status
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe
Check Member Assignments (Verbose)
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose
Monitor Lag
kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r
Recommended Fix Sequence
1. Check current state with –describe –members –verbose
2. Scale down completely: kubectl scale statefulset <name> –replicas=0
3. Wait for session timeout (60+ seconds): sleep 90
4. Verify group is empty
5. Delete consumer group (if still exists)
6. Scale back up: kubectl scale statefulset <name> –replicas=8
7. Verify new distribution after 30 seconds
Prevention (Long-term Fixes)
- Do not use static group membership unless you have a specific need
- Use cooperative rebalancing if on Kafka 2.4+
- Monitor partition assignment regularly
- Set appropriate max.poll.interval.ms to detect slow consumers
- Use standby replicas for stateful applications
- Ensure partition count is divisible by expected consumer count
Related Configurations
| Configuration | Default | Description |
| session.timeout.ms | 45000 | Time before broker considers consumer dead |
| heartbeat.interval.ms | 3000 | Frequency of heartbeats to broker |
| max.poll.interval.ms | 300000 | Max time between poll() calls |
| group.instance.id | null | Static membership identifier |
| num.standby.replicas | 0 | Number of standby replicas for state stores |
| acceptable.recovery.lag | 10000 | Max lag before replica is considered caught up |
Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”
-Satyjeet Shukla
AI Strategist & Solutions Architect
