Tag: #KafkaRebalance

  • Kafka Streams Rebalance Troubleshooting

    Kafka Streams Rebalance Troubleshooting

    Confluent Kafka 2.x

    Problem Statement

    ComponentConfiguration
    Topic Partitions32
    Consumer TypeKafka Streams (intermediate topic)
    DeploymentStatefulSet with 8 replicas
    Stream Threads2 per replica (16 total)
    Expected Distribution2 partitions per thread

    Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.

    Root Cause Analysis

    Why This Happens

    Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.

    StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.

    State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.

    Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.

    Solutions

    1. Check for Static Group Membership

    If you are using static group membership, the broker remembers the assignment even after pod restart.

    # Check if this is set in your Kafka Streams config

    group.instance.id=<some-static-id>

    Fix: Remove it entirely or make it dynamic.

    2. Proper Scale Down/Up with Timeout Wait

    The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).

    kubectl scale statefulset <statefulset-name> –replicas=0

    sleep 60

    kubectl scale statefulset <statefulset-name> –replicas=8

    3. Delete the Consumer Group

    ⚠️ Warning: Only do this when ALL consumers are stopped.

    # Scale down to 0

    kubectl scale statefulset <statefulset-name> –replicas=0

    # Verify no active members

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members

    # Delete the consumer group

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete

    # Scale back up

    kubectl scale statefulset <statefulset-name> –replicas=8

    4. Reset Consumer Group Offsets

    Resets assignments while preserving current offsets:

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute

    5. Force New Client IDs

    Modify your StatefulSet to include a random/timestamp suffix in client ID.

    6. Change Application ID (Nuclear Option)

    Creates a completely new consumer group:

    props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);

    ⚠️ Warning: This will create a new consumer group and reprocess from the beginning.

    7. Enable Cooperative Rebalancing (Kafka 2.4+)

    For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.

    props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);

    8. Tune Partition Assignment

    Adjust these configurations for better distribution:

    ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L

    NUM_STANDBY_REPLICAS_CONFIG = 1

    PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L

    Diagnostic Commands

    Check Current Consumer Group Status

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe

    Check Member Assignments (Verbose)

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose

    Monitor Lag

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r

    Recommended Fix Sequence

    1. Check current state with –describe –members –verbose

    2. Scale down completely: kubectl scale statefulset <name> –replicas=0

    3. Wait for session timeout (60+ seconds): sleep 90

    4. Verify group is empty

    5. Delete consumer group (if still exists)

    6. Scale back up: kubectl scale statefulset <name> –replicas=8

    7. Verify new distribution after 30 seconds

    Prevention (Long-term Fixes)

    • Do not use static group membership unless you have a specific need
    • Use cooperative rebalancing if on Kafka 2.4+
    • Monitor partition assignment regularly
    • Set appropriate max.poll.interval.ms to detect slow consumers
    • Use standby replicas for stateful applications
    • Ensure partition count is divisible by expected consumer count

    Related Configurations

    ConfigurationDefaultDescription
    session.timeout.ms45000Time before broker considers consumer dead
    heartbeat.interval.ms3000Frequency of heartbeats to broker
    max.poll.interval.ms300000Max time between poll() calls
    group.instance.idnullStatic membership identifier
    num.standby.replicas0Number of standby replicas for state stores
    acceptable.recovery.lag10000Max lag before replica is considered caught up

    Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”

    -Satyjeet Shukla

    AI Strategist & Solutions Architect