Tag: cloud

Beyond the Binary: Monoliths, Event-Driven Systems, and the Hybrid Future

In software engineering, architectural discussions often devolve into a binary choice: the “legacy” Monolith versus the “modern” Microservices. This dichotomy is not only false but dangerous. It forces teams to choose between the operational simplicity of a single unit and the decoupled scalability of distributed systems, often ignoring a vast middle ground.

Recently, the rise of API-driven Event-Based Architectures (EDA) has added a third dimension, promising reactive, real-time systems. But for a technical leader or a systems architect, the question isn’t “which is best?” but “which constraints am I optimising for?”

This article explores the trade-offs between Monolithic and Event-Driven systems and makes a case for the pragmatic middle ground: the Hybrid approach.

1. The Monolith: Alive and Kicking

The term “Monolith” often conjures images of unmaintainable “Big Ball of Mud” codebases. However, a well-designed Modular Monolith is a legitimate architectural choice for 90% of use cases.

The Strengths

Transactional Integrity (ACID): The single biggest advantage of a monolith is the ability to run a complex business process (e.g., “Place Order”) within a single database transaction. If any part fails, the whole operation rolls back. In distributed systems, this simple guarantee is replaced by complex Sagas or two-phase commits.
Operational Simplicity: One deployment pipeline, one monitoring dashboard, one database to back up. The cognitive load on the ops team is significantly lower.
Zero-Latency Communication: Function calls are orders of magnitude faster than network calls. You don’t need to worry about serialization overhead, network partitions, or retries.

Limiters

The monolith hits a wall when team scale outpaces code modularity. When 50 developers are merging into the same repo, the merge conflicts and slow CI/CD pipelines become the bottleneck.

2. API-Driven Event-Based Architectures

In this model, services don’t just “call” each other via HTTP; they emit “events” (facts about what just happened) to a broker (Kafka, RabbitMQ, EventBridge). Other services subscribe to these events and react.

The Strengths

True Decoupling: The OrderService doesn’t know the EmailService exists. It just screams “OrderPlaced” into the void. This allows you to plug in new functionality (e.g., a “FraudDetection” service) without touching the core flow.
Asynchronous Resilience: If the InventoryService is down, the OrderService can still accept orders. The events will just sit in the queue until the consumer recovers.
Scale Asymmetry: An image processing service might need 100x more CPU than the user profile service. You can scale them independently without over-provisioning the rest of the system.

The Tax

The cost of this power is complexity. You now live in a world of eventual consistency. A user might place an order but not see it in their history for 2 seconds. Debugging a flow that jumps across 5 services via inconsistent message queues requires sophisticated observability (Distributed Tracing) and mature DevOps practices.

3. The Hybrid Approach: The “Citadel” and Modular Monoliths

It is rarely an all-or-nothing decision. The most successful systems often employ a hybrid strategy, famously described by some as the Citadel Pattern or the Strangler Fig.

Pattern A: The Modular Monolith (Internal EDA)

You build a single deployable unit, but internally, you enforce strict boundaries.

Internal Events: Instead of Module A calling Module B’s class directly, you can use an in-memory event bus. When a user registers, the User Module publishes a domain event. The Notification Module subscribes to it.
Why?: This gives you the decoupling benefits of EDA (code isolation) without the operational tax of distributed systems (network failures, serialization).

Pattern B: The Citadel (Monolith + Satellites)

Keep your core, complex business domain (e.g., the billing engine or policy ledger) in a Monolith. This domain likely benefits from ACID transactions and complex data joins.

Offload peripheral or high-scale volatility to microservices.
Example: A core Banking Monolith handles the ledger. However, the “PDF Statement Generation” is an external microservice because it is CPU intensive and stateless. The “Mobile API Adapter” is a separate service to allow for rapid iteration on UI needs without risking the core bank.

4. The Cost Dimension: Infrastructure & People

Cost is often the silent killer in architectural decisions. It’s not just about the AWS bill; it’s about the Total Cost of Ownership (TCO).

Infrastructure Costs

Monolith: generally cheaper at low-to-medium scale. You pay for fixed compute (e.g., 2 EC2 instances). You save on data transfer costs because communication is in-memory. However, scaling is inefficient: if one module needs more RAM, you have to upgrade the entire server.
Event-Driven/Microservices: The “Cloud Tax” is real. You pay for:
Managed Services: Kafka (MSK) or RabbitMQ clusters are not cheap to run or cheap to rent.
Data Transfer: Every event crossing an Availability Zone (AZ) or Region boundary incurs a cost.
Base Overhead: Running 50 containers requires more base CPU/RAM overhead than running 1 container with 50 modules.
Savings: You only save money at massive scale, where granular scaling (generating 1000 tiny instances for just the billing service) outweighs the overhead tax.

Organizational Costs (Engineering Salary)

Monolith: Lower. Generalist developers can contribute easily. Operations require fewer specialists.
Event-Driven: Higher. You need strict platform engineering, SREs to manage the service mesh/brokers, and developers who understand distributed tracing and idempotency.

Decision Framework: When to Prefer Which?

Don’t follow the hype. Follow the constraints.

Constraint	Prefer Monolith	Prefer Event-Driven/Microservices
Team Size	Small (< 20 engineers), tight communication.	Large, multiple independent squads (2-pizza teams).
Domain Complexity	High complexity, deep coupling, needs strict consistency.	Clearly defined sub-domains (e.g., Shipping is distinct from Billing).
Traffic Patterns	Uniform scale requirement.	Asymmetrical scale (one feature needs massive scale).
Consistency	Strong (ACID) is non-negotiable.	Eventual consistency is acceptable.
Cost Sensitivity	Bootstrapped/Low Budget. Optimizes for low operational overhead.	High Budget/Enterprise. Willing to pay premium for high availability and granular scale.

Conclusion

Hybrid approaches allow you to “architect for the team you have, not the team you want.” Start with a Modular Monolith. Use internal events to decouple your code. Only when a specific module needs independent scaling or has a distinct release cycle should you carve it out into a separate service.

By treating architecture as a dial rather than a switch, you avoid the complexity tax until you actually need the power it buys you.

-Satyjeet Shukla

AI Strategist & Solutions Architect

January 19, 2026

Kafka Streams Rebalance Troubleshooting

Confluent Kafka 2.x

Problem Statement

Component	Configuration
Topic Partitions	32
Consumer Type	Kafka Streams (intermediate topic)
Deployment	StatefulSet with 8 replicas
Stream Threads	2 per replica (16 total)
Expected Distribution	2 partitions per thread

Issue: 10 partitions with lag are all assigned to a single client while 7 other clients sit idle. Deleting pods or scaling down doesn’t trigger proper rebalancing—the same pod keeps picking up the load.

Root Cause Analysis

Why This Happens

Sticky Partition Assignor: Kafka Streams uses StreamsPartitionAssignor which is sticky by design. It tries to maintain partition assignments across rebalances to minimize state migration.

StatefulSet Predictable Naming: Pod names are predictable (app-0, app-1, etc.). The client.id remains the same after pod restart. Kafka treats it as the “same” consumer returning.

State Store Affinity: For stateful operations, the assignor prefers keeping partitions with consumers that already have the state.

Static Group Membership: If group.instance.id is configured, the broker remembers assignments even after pod restart.

Solutions

1. Check for Static Group Membership

If you are using static group membership, the broker remembers the assignment even after pod restart.

# Check if this is set in your Kafka Streams config

group.instance.id=<some-static-id>

Fix: Remove it entirely or make it dynamic.

2. Proper Scale Down/Up with Timeout Wait

The key is waiting for session.timeout.ms to expire (default: 45 seconds in Kafka Streams 2.x).

kubectl scale statefulset <statefulset-name> –replicas=0

sleep 60

kubectl scale statefulset <statefulset-name> –replicas=8

3. Delete the Consumer Group

⚠️ Warning: Only do this when ALL consumers are stopped.

# Scale down to 0

kubectl scale statefulset <statefulset-name> –replicas=0

# Verify no active members

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members

# Delete the consumer group

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –delete

# Scale back up

kubectl scale statefulset <statefulset-name> –replicas=8

4. Reset Consumer Group Offsets

Resets assignments while preserving current offsets:

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –reset-offsets –to-current –all-topics –execute

5. Force New Client IDs

Modify your StatefulSet to include a random/timestamp suffix in client ID.

6. Change Application ID (Nuclear Option)

Creates a completely new consumer group:

props.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-app-v2”);

⚠️ Warning: This will create a new consumer group and reprocess from the beginning.

7. Enable Cooperative Rebalancing (Kafka 2.4+)

For Kafka Streams 2.4 and later, cooperative rebalancing provides incremental rebalancing.

props.put(StreamsConfig.UPGRADE_FROM_CONFIG, “2.3”);

8. Tune Partition Assignment

Adjust these configurations for better distribution:

ACCEPTABLE_RECOVERY_LAG_CONFIG = 10000L

NUM_STANDBY_REPLICAS_CONFIG = 1

PROBING_REBALANCE_INTERVAL_MS_CONFIG = 600000L

Diagnostic Commands

Check Current Consumer Group Status

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe

Check Member Assignments (Verbose)

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe –members –verbose

Monitor Lag

kafka-consumer-groups –bootstrap-server <broker:port> –group <application.id> –describe | grep -v “^$” | sort -t” ” -k5 -n -r

Recommended Fix Sequence

1. Check current state with –describe –members –verbose

2. Scale down completely: kubectl scale statefulset <name> –replicas=0

3. Wait for session timeout (60+ seconds): sleep 90

4. Verify group is empty

5. Delete consumer group (if still exists)

6. Scale back up: kubectl scale statefulset <name> –replicas=8

7. Verify new distribution after 30 seconds

Prevention (Long-term Fixes)

Do not use static group membership unless you have a specific need
Use cooperative rebalancing if on Kafka 2.4+
Monitor partition assignment regularly
Set appropriate max.poll.interval.ms to detect slow consumers
Use standby replicas for stateful applications
Ensure partition count is divisible by expected consumer count

Related Configurations

Configuration	Default	Description
session.timeout.ms	45000	Time before broker considers consumer dead
heartbeat.interval.ms	3000	Frequency of heartbeats to broker
max.poll.interval.ms	300000	Max time between poll() calls
group.instance.id	null	Static membership identifier
num.standby.replicas	0	Number of standby replicas for state stores
acceptable.recovery.lag	10000	Max lag before replica is considered caught up

Note: “Recently, I helped troubleshoot a specific Kafka issue where partitions were ‘sticking’ to a single client. After sharing a guide with the individual who reported it, I realized this knowledge would be beneficial for the wider community. Here are the steps to resolve it.”

-Satyjeet Shukla

AI Strategist & Solutions Architect

January 16, 2026