Tag: cloud

  • Beyond the Binary: Monoliths, Event-Driven Systems, and the Hybrid Future

    Beyond the Binary: Monoliths, Event-Driven Systems, and the Hybrid Future

    In software engineering, architectural discussions often devolve into a binary choice: the “legacy” Monolith versus the “modern” Microservices. This dichotomy is not only false but dangerous. It forces teams to choose between the operational simplicity of a single unit and the decoupled scalability of distributed systems, often ignoring a vast middle ground.

    Recently, the rise of API-driven Event-Based Architectures (EDA) has added a third dimension, promising reactive, real-time systems. But for a technical leader or a systems architect, the question isn’t “which is best?” but “which constraints am I optimising for?”

    This article explores the trade-offs between Monolithic and Event-Driven systems and makes a case for the pragmatic middle ground: the Hybrid approach.

    1. The Monolith: Alive and Kicking

    The term “Monolith” often conjures images of unmaintainable “Big Ball of Mud” codebases. However, a well-designed Modular Monolith is a legitimate architectural choice for 90% of use cases.

    The Strengths

    •   Transactional Integrity (ACID): The single biggest advantage of a monolith is the ability to run a complex business process (e.g., “Place Order”) within a single database transaction. If any part fails, the whole operation rolls back. In distributed systems, this simple guarantee is replaced by complex Sagas or two-phase commits.
    •   Operational Simplicity: One deployment pipeline, one monitoring dashboard, one database to back up. The cognitive load on the ops team is significantly lower.
    •   Zero-Latency Communication: Function calls are orders of magnitude faster than network calls. You don’t need to worry about serialization overhead, network partitions, or retries.

    Limiters

    The monolith hits a wall when team scale outpaces code modularity. When 50 developers are merging into the same repo, the merge conflicts and slow CI/CD pipelines become the bottleneck.

    2. API-Driven Event-Based Architectures

    In this model, services don’t just “call” each other via HTTP; they emit “events” (facts about what just happened) to a broker (Kafka, RabbitMQ, EventBridge). Other services subscribe to these events and react.

    The Strengths

    •   True Decoupling: The OrderService doesn’t know the EmailService exists. It just screams “OrderPlaced” into the void. This allows you to plug in new functionality (e.g., a “FraudDetection” service) without touching the core flow.
    •   Asynchronous Resilience: If the InventoryService is down, the OrderService can still accept orders. The events will just sit in the queue until the consumer recovers.
    •   Scale Asymmetry: An image processing service might need 100x more CPU than the user profile service. You can scale them independently without over-provisioning the rest of the system.

    The Tax

    The cost of this power is complexity. You now live in a world of eventual consistency. A user might place an order but not see it in their history for 2 seconds. Debugging a flow that jumps across 5 services via inconsistent message queues requires sophisticated observability (Distributed Tracing) and mature DevOps practices.

    3. The Hybrid Approach: The “Citadel” and Modular Monoliths

    It is rarely an all-or-nothing decision. The most successful systems often employ a hybrid strategy, famously described by some as the Citadel Pattern or the Strangler Fig.

    Pattern A: The Modular Monolith (Internal EDA)

    You build a single deployable unit, but internally, you enforce strict boundaries.

    •   Internal Events: Instead of Module A calling Module B’s class directly, you can use an in-memory event bus. When a user registers, the User Module publishes a domain event. The Notification Module subscribes to it.
    •   Why?: This gives you the decoupling benefits of EDA (code isolation) without the operational tax of distributed systems (network failures, serialization).

    Pattern B: The Citadel (Monolith + Satellites)

    Keep your core, complex business domain (e.g., the billing engine or policy ledger) in a Monolith. This domain likely benefits from ACID transactions and complex data joins.

    •   Offload peripheral or high-scale volatility to microservices.
    •   Example: A core Banking Monolith handles the ledger. However, the “PDF Statement Generation” is an external microservice because it is CPU intensive and stateless. The “Mobile API Adapter” is a separate service to allow for rapid iteration on UI needs without risking the core bank.

    4. The Cost Dimension: Infrastructure & People

    Cost is often the silent killer in architectural decisions. It’s not just about the AWS bill; it’s about the Total Cost of Ownership (TCO).

    Infrastructure Costs

    •   Monolith: generally cheaper at low-to-medium scale. You pay for fixed compute (e.g., 2 EC2 instances). You save on data transfer costs because communication is in-memory. However, scaling is inefficient: if one module needs more RAM, you have to upgrade the entire server.
    •   Event-Driven/Microservices: The “Cloud Tax” is real. You pay for:
    •   Managed Services: Kafka (MSK) or RabbitMQ clusters are not cheap to run or cheap to rent.
    •   Data Transfer: Every event crossing an Availability Zone (AZ) or Region boundary incurs a cost.
    •   Base Overhead: Running 50 containers requires more base CPU/RAM overhead than running 1 container with 50 modules.
    •   Savings: You only save money at massive scale, where granular scaling (generating 1000 tiny instances for just the billing service) outweighs the overhead tax.

    Organizational Costs (Engineering Salary)

    •   Monolith: Lower. Generalist developers can contribute easily. Operations require fewer specialists.
    •   Event-Driven: Higher. You need strict platform engineering, SREs to manage the service mesh/brokers, and developers who understand distributed tracing and idempotency.

    Decision Framework: When to Prefer Which?

    Don’t follow the hype. Follow the constraints.

    ConstraintPrefer MonolithPrefer Event-Driven/Microservices
    Team SizeSmall (< 20 engineers), tight communication.Large, multiple independent squads (2-pizza teams).
    Domain ComplexityHigh complexity, deep coupling, needs strict consistency.Clearly defined sub-domains (e.g., Shipping is distinct from Billing).
    Traffic PatternsUniform scale requirement.Asymmetrical scale (one feature needs massive scale).
    ConsistencyStrong (ACID) is non-negotiable.Eventual consistency is acceptable.
    Cost SensitivityBootstrapped/Low Budget. Optimizes for low operational overhead.High Budget/Enterprise. Willing to pay premium for high availability and granular scale.

    Conclusion

    Hybrid approaches allow you to “architect for the team you have, not the team you want.” Start with a Modular Monolith. Use internal events to decouple your code. Only when a specific module needs independent scaling or has a distinct release cycle should you carve it out into a separate service.

    By treating architecture as a dial rather than a switch, you avoid the complexity tax until you actually need the power it buys you.

    -Satyjeet Shukla

    AI Strategist & Solutions Architect

  • Kafka Streams Rebalance Troubleshooting

    Kafka Streams Rebalance Troubleshooting

    Confluent Kafka 2.x

    Problem Statement

    ComponentConfiguration
    Topic Partitions32
    Consumer TypeKafka Streams (intermediate topic)
    DeploymentStatefulSet with 8 replicas
    Stream Threads2 per replica (16 total)
    Expected Distribution2 partitions per thread

    Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.

    Root Cause Analysis

    Why This Happens

    Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.

    StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.

    State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.

    Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.

    Solutions

    1. Check for Static Group Membership

    If you are using static group membership, the broker remembers the assignment even after pod restart.

    # Check if this is set in your Kafka Streams config

    group.instance.id=<some-static-id>

    Fix: Remove it entirely or make it dynamic.

    2. Proper Scale Down/Up with Timeout Wait

    The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).

    kubectl scale statefulset <statefulset-name> –replicas=0

    sleep 60

    kubectl scale statefulset <statefulset-name> –replicas=8

    3. Delete the Consumer Group

    ⚠️ Warning: Only do this when ALL consumers are stopped.

    # Scale down to 0

    kubectl scale statefulset <statefulset-name> –replicas=0

    # Verify no active members

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members

    # Delete the consumer group

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete

    # Scale back up

    kubectl scale statefulset <statefulset-name> –replicas=8

    4. Reset Consumer Group Offsets

    Resets assignments while preserving current offsets:

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute

    5. Force New Client IDs

    Modify your StatefulSet to include a random/timestamp suffix in client ID.

    6. Change Application ID (Nuclear Option)

    Creates a completely new consumer group:

    props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);

    ⚠️ Warning: This will create a new consumer group and reprocess from the beginning.

    7. Enable Cooperative Rebalancing (Kafka 2.4+)

    For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.

    props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);

    8. Tune Partition Assignment

    Adjust these configurations for better distribution:

    ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L

    NUM_STANDBY_REPLICAS_CONFIG = 1

    PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L

    Diagnostic Commands

    Check Current Consumer Group Status

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe

    Check Member Assignments (Verbose)

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose

    Monitor Lag

    kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r

    Recommended Fix Sequence

    1. Check current state with –describe –members –verbose

    2. Scale down completely: kubectl scale statefulset <name> –replicas=0

    3. Wait for session timeout (60+ seconds): sleep 90

    4. Verify group is empty

    5. Delete consumer group (if still exists)

    6. Scale back up: kubectl scale statefulset <name> –replicas=8

    7. Verify new distribution after 30 seconds

    Prevention (Long-term Fixes)

    • Do not use static group membership unless you have a specific need
    • Use cooperative rebalancing if on Kafka 2.4+
    • Monitor partition assignment regularly
    • Set appropriate max.poll.interval.ms to detect slow consumers
    • Use standby replicas for stateful applications
    • Ensure partition count is divisible by expected consumer count

    Related Configurations

    ConfigurationDefaultDescription
    session.timeout.ms45000Time before broker considers consumer dead
    heartbeat.interval.ms3000Frequency of heartbeats to broker
    max.poll.interval.ms300000Max time between poll() calls
    group.instance.idnullStatic membership identifier
    num.standby.replicas0Number of standby replicas for state stores
    acceptable.recovery.lag10000Max lag before replica is considered caught up

    Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”

    -Satyjeet Shukla

    AI Strategist & Solutions Architect