Confluent Kafka 2.x

Problem Statement

Component	Configuration
Topic Partitions	32
Consumer Type	Kafka Streams (intermediate topic)
Deployment	StatefulSet with 8 replicas
Stream Threads	2 per replica (16 total)
Expected Distribution	2 partitions per thread

Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.

Root Cause Analysis

Why This Happens

Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.

StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.

State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.

Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.

Solutions

1. Check for Static Group Membership

If you are using static group membership, the broker remembers the assignment even after pod restart.

# Check if this is set in your Kafka Streams config

group.instance.id=<some-static-id>

Fix: Remove it entirely or make it dynamic.

2. Proper Scale Down/Up with Timeout Wait

The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).

kubectl scale statefulset <statefulset-name> –replicas=0

sleep 60

kubectl scale statefulset <statefulset-name> –replicas=8

3. Delete the Consumer Group

⚠️ Warning: Only do this when ALL consumers are stopped.

# Scale down to 0

kubectl scale statefulset <statefulset-name> –replicas=0

# Verify no active members

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members

# Delete the consumer group

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete

# Scale back up

kubectl scale statefulset <statefulset-name> –replicas=8

4. Reset Consumer Group Offsets

Resets assignments while preserving current offsets:

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute

5. Force New Client IDs

Modify your StatefulSet to include a random/timestamp suffix in client ID.

6. Change Application ID (Nuclear Option)

Creates a completely new consumer group:

props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);

⚠️ Warning: This will create a new consumer group and reprocess from the beginning.

7. Enable Cooperative Rebalancing (Kafka 2.4+)

For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.

props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);

8. Tune Partition Assignment

Adjust these configurations for better distribution:

ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L

NUM_STANDBY_REPLICAS_CONFIG = 1

PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L

Diagnostic Commands

Check Current Consumer Group Status

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe

Check Member Assignments (Verbose)

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose

Monitor Lag

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r

Recommended Fix Sequence

1. Check current state with –describe –members –verbose

2. Scale down completely: kubectl scale statefulset <name> –replicas=0

3. Wait for session timeout (60+ seconds): sleep 90

4. Verify group is empty

5. Delete consumer group (if still exists)

6. Scale back up: kubectl scale statefulset <name> –replicas=8

7. Verify new distribution after 30 seconds

Prevention (Long-term Fixes)

Do not use static group membership unless you have a specific need
Use cooperative rebalancing if on Kafka 2.4+
Monitor partition assignment regularly
Set appropriate max.poll.interval.ms to detect slow consumers
Use standby replicas for stateful applications
Ensure partition count is divisible by expected consumer count

Related Configurations

Configuration	Default	Description
session.timeout.ms	45000	Time before broker considers consumer dead
heartbeat.interval.ms	3000	Frequency of heartbeats to broker
max.poll.interval.ms	300000	Max time between poll() calls
group.instance.id	null	Static membership identifier
num.standby.replicas	0	Number of standby replicas for state stores
acceptable.recovery.lag	10000	Max lag before replica is considered caught up

Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”

-Satyjeet Shukla

AI Strategist & Solutions Architect