Kafka Streams Rebalance Troubleshooting

Confluent Kafka 2.x

Problem Statement

ComponentConfiguration
Topic Partitions32
Consumer TypeKafka Streams (intermediate topic)
DeploymentStatefulSet with 8 replicas
Stream Threads2 per replica (16 total)
Expected Distribution2 partitions per thread

Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.

Root Cause Analysis

Why This Happens

Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.

StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.

State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.

Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.

Solutions

1. Check for Static Group Membership

If you are using static group membership, the broker remembers the assignment even after pod restart.

# Check if this is set in your Kafka Streams config

group.instance.id=<some-static-id>

Fix: Remove it entirely or make it dynamic.

2. Proper Scale Down/Up with Timeout Wait

The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).

kubectl scale statefulset <statefulset-name> –replicas=0

sleep 60

kubectl scale statefulset <statefulset-name> –replicas=8

3. Delete the Consumer Group

⚠️ Warning: Only do this when ALL consumers are stopped.

# Scale down to 0

kubectl scale statefulset <statefulset-name> –replicas=0

# Verify no active members

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members

# Delete the consumer group

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete

# Scale back up

kubectl scale statefulset <statefulset-name> –replicas=8

4. Reset Consumer Group Offsets

Resets assignments while preserving current offsets:

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute

5. Force New Client IDs

Modify your StatefulSet to include a random/timestamp suffix in client ID.

6. Change Application ID (Nuclear Option)

Creates a completely new consumer group:

props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);

⚠️ Warning: This will create a new consumer group and reprocess from the beginning.

7. Enable Cooperative Rebalancing (Kafka 2.4+)

For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.

props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);

8. Tune Partition Assignment

Adjust these configurations for better distribution:

ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L

NUM_STANDBY_REPLICAS_CONFIG = 1

PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L

Diagnostic Commands

Check Current Consumer Group Status

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe

Check Member Assignments (Verbose)

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose

Monitor Lag

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r

Recommended Fix Sequence

1. Check current state with –describe –members –verbose

2. Scale down completely: kubectl scale statefulset <name> –replicas=0

3. Wait for session timeout (60+ seconds): sleep 90

4. Verify group is empty

5. Delete consumer group (if still exists)

6. Scale back up: kubectl scale statefulset <name> –replicas=8

7. Verify new distribution after 30 seconds

Prevention (Long-term Fixes)

  • Do not use static group membership unless you have a specific need
  • Use cooperative rebalancing if on Kafka 2.4+
  • Monitor partition assignment regularly
  • Set appropriate max.poll.interval.ms to detect slow consumers
  • Use standby replicas for stateful applications
  • Ensure partition count is divisible by expected consumer count

Related Configurations

ConfigurationDefaultDescription
session.timeout.ms45000Time before broker considers consumer dead
heartbeat.interval.ms3000Frequency of heartbeats to broker
max.poll.interval.ms300000Max time between poll() calls
group.instance.idnullStatic membership identifier
num.standby.replicas0Number of standby replicas for state stores
acceptable.recovery.lag10000Max lag before replica is considered caught up

Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”

-Satyjeet Shukla

AI Strategist & Solutions Architect

Comments

Practical insights to help you grow your Skill/Business faster.