Home / System Design / Case Studies

Case Studies

Seven real-world system designs walked end-to-end — from requirements to architecture to the component decisions that separate a passing answer from a strong one.

  • Covers live streaming, inventory, payments, checkout, video streaming, fraud detection, and notifications
  • Every case study includes an architecture diagram, trade-offs table, interviewer Q&A, and a Senior vs Staff comparison
  • The Master Reference at the end has the HLD interview formula, consistency decision rules, and key design patterns in quick-scan tables

Case Study 1: Live Comment Streaming

The problem: Millions of users watching a live video need to receive comments in real time. Two properties make this hard at scale:

  • Read traffic is massive — millions of viewers receiving comments
  • Write traffic is small — far fewer people actually posting

So we separate them. A dedicated layer of Realtime Messaging Servers (RMS) handles pushing comments to viewers, letting us scale reads and writes independently.


The Players

NGINX — Load Balancer Sits at the front. Routes each user connection to an RMS. Once the connection is handed off, NGINX is out of the picture — it never touches comments.

Realtime Messaging Server (RMS) Where all the logic lives. Each RMS:

  • Accepts user connections
  • Maintains a local in-memory map of who is watching what
  • Subscribes to a message backend (Redis or Kafka) to receive new comments
  • Pushes comments to the right users over open persistent connections

Redis / Kafka — Message Backend Carries comments from the comment service to the RMS servers. Choice of backend depends on the approach (see below).


SSE vs WebSocket for Live Comments

The user's browser opens a persistent connection to the RMS. The RMS holds it open and writes to it whenever a new comment arrives. Two technologies can do this:

SSE WebSocket
Direction Server → client only Full duplex
Auto reconnect ✅ Built in ❌ Write it yourself
Catch-up on reconnect Last-Event-ID built in ❌ Build it yourself
Setup complexity Low — plain HTTP Higher — protocol upgrade
Best for Live feeds, notifications True bidirectional chat

For live comments, SSE is the right choice. Viewers only receive — posting a comment is a separate regular HTTP POST. No need for the complexity of WebSockets.


How the RMS Pushes Comments Back

The RMS does not dial back to the user's IP. The client always initiates the connection. The RMS holds that open connection as an object and writes to it when a comment arrives:

// When user connects and says "watching video3"
SseEmitter emitter = new SseEmitter();
emitterMap.get("video3").add(emitter);

// When a comment arrives for video3
for (SseEmitter e : emitterMap.get("video3")) {
    e.send(comment);
}

Note on X-Forwarded-For: NGINX passes the real client IP via this header, useful for rate limiting and logging. But for pushing messages back, no IP is needed — the RMS already holds the connection object.

X-Forwarded-For: 203.0.113.5, 10.0.0.1
                 ↑ real user   ↑ proxy

Handling Dropped Connections and Catch-Up

Connections drop — bad WiFi, phone going to sleep, network switches. The RMS detects the drop and cleans up:

emitter.onCompletion(() -> emitterMap.get("video3").remove(emitter));
emitter.onTimeout(()    -> emitterMap.get("video3").remove(emitter));
emitter.onError(e       -> emitterMap.get("video3").remove(emitter));

Reconnect: SSE EventSource reconnects automatically. No client-side code needed.

Catch-up on missed comments: SSE assigns an ID to every message:

id: 1001
data: {"comment": "hello"}

id: 1002
data: {"comment": "great video"}

When the browser reconnects, it automatically sends Last-Event-ID: 1002. The RMS fetches all comments after ID 1002 from the database, sends them first, then resumes the live feed.

HTTP catch-up as an alternative: For more control, the client can track the last comment ID locally (localStorage on web, app storage on mobile) and explicitly request catch-up via HTTP:

GET /comments/{videoId}?since={lastCommentId}&limit=100

This lets the client control the UX — instead of dumping missed comments instantly, it can animate them smoothly at 2-3x speed or show "You missed 47 comments" with an option to jump straight to live.

Replay limits: Set a practical cap — replaying the last 5 minutes is reasonable. For longer gaps, degrade gracefully: show "You were away for a while — jump to live?" rather than flooding the user with thousands of stale comments.

Cross-server catch-up: When a user reconnects, load balancing may route them to a different RMS than before. That new server has no local state for this user. Solution: keep recent comments for each video in a shared Redis cache (e.g. comments:video3 as a sorted set by ID). Any RMS can replay recent comments for any video without hitting the DB. Redis TTL of 10-15 minutes covers the practical reconnect window.

Deduplication: If comments arrive via SSE while an HTTP catch-up response is still in flight, the client may receive the same comment twice. The client must deduplicate by comment ID before rendering — merge the two streams and discard duplicates.

Mobile proactive disconnect: Rather than waiting for the OS to kill the SSE connection when the app backgrounds, the client can proactively disconnect when it detects an onPause / willResignActive event and record the last comment ID. On foreground, reconnect and catch up. More efficient than maintaining a connection the OS will throttle anyway.

SSE Last-Event-ID HTTP ?since= catch-up
Automatic ✅ Browser handles it ❌ Client code needed
UX control ❌ Dumps missed comments ✅ Can animate or summarize
Works across servers Needs shared Redis/DB Needs shared Redis/DB

The Full Flow

User B opens browser
        │
        ▼
NGINX routes to an RMS (based on videoId)
        │
        ▼
RMS accepts SSE connection from User B
        │
        ├── adds to local map: { video3: [userB] }
        └── subscribes to message backend for video3 comments
                          │
                          ▼
              Someone posts a comment on video3
                          │
                          ▼
              Comment Service publishes to Redis / Kafka
                          │
                          ▼
              RMS receives comment → checks local map
              → pushes to User B via open SSE connection

The Scaling Problem

At small scale, one RMS handles everything fine. At large scale (millions of viewers, many RMS servers):

When a comment is posted, how does the right RMS know about it?

This is the pub/sub problem. Three approaches solve it with different tradeoffs.


Approach 1 — Naive Pub/Sub

Every RMS subscribes to one single channel from the start. When any comment is posted, every RMS receives it, checks its local map, and discards it if none of its users are watching that video.

Comment on Video 3 posted
        │
        ▼
Published to one channel → received by ALL servers

Server A  has video3 viewers ✅  pushes to User B
Server B  no video3 viewers  ❌  discards
Server C  no video3 viewers  ❌  discards
Backend Role
Kafka One topic — all RMS servers consume every message
Redis One channel (comments:all) — all RMS servers subscribe

Tradeoffs:

  • ✅ Simple to build — no routing logic
  • ✅ Easy to scale — just subscribe to the same channel
  • ❌ Wasteful — every server processes every comment regardless of relevance
  • ❌ Doesn't scale at Facebook/YouTube traffic levels

Use when: Early prototype or platforms with very few concurrent live videos.


Approach 2 — Smarter Subscriptions + Co-location

Videos are bucketed into N channels using a hash:

channel = hash(videoId) % N

hash("video3") % 100 = channel_7
hash("video9") % 100 = channel_7   ← same channel, different video
hash("video5") % 100 = channel_42

When User B connects and says "watching Video 3":

  1. RMS computes hash(video3) % Nchannel_7
  2. RMS subscribes to channel_7 in Redis (if not already subscribed)
  3. RMS adds User B to its local map

Only servers subscribed to channel_7 receive those comments.

The co-location problem: With round-robin load balancing, a single RMS could end up with viewers watching hundreds of different videos, forcing it to subscribe to hundreds of channels — killing the efficiency gain.

Fix — consistent hashing on NGINX: Route all viewers of the same video to the same RMS. Everyone watching Video 3 always lands on Server A.

Everyone watching Video 3 → always Server A → subscribes to channel_7
Everyone watching Video 9 → always Server B → subscribes to channel_7
Everyone watching Video 5 → always Server C → subscribes to channel_42
Backend Fit Why
Redis ✅ Best fit Subscribe/unsubscribe instantly, low latency, handles users switching videos cleanly
Kafka ⚠️ Painful Consumer group rebalancing is slow — not designed for dynamic subscriptions

Redis fire-and-forget is fine here because comments are persisted to the DB the moment they're posted. Anything missed during a brief disconnect is recovered via Last-Event-ID catch-up.

Tradeoffs:

  • ✅ Much more efficient — servers only receive relevant comments
  • ✅ Redis handles dynamic subscriptions cleanly
  • ✅ Dropped Redis messages are fine — DB has them, catch-up handles the rest
  • ❌ NGINX needs consistent hashing configured, not just round-robin
  • ❌ Hash collisions mean servers occasionally receive some irrelevant comments
  • ❌ Kafka is a poor fit due to slow rebalancing

Use when: Medium to large scale. Redis is the right backbone. Best answer in an interview.


Approach 3 — Dispatcher Service

Instead of RMS servers subscribing and pulling, a Dispatcher Service pushes comments directly to only the servers that need them.

When User B connects and says "watching Video 3", the RMS registers with the Dispatcher: "I have a viewer for Video 3." The Dispatcher maintains a live routing map:

{
  "video3": [Server A, Server C],
  "video9": [Server B]
}

When a comment on Video 3 is posted:

Comment Service → asks Dispatcher "who has Video 3 viewers?"
        │
        ▼
Dispatcher looks up map → Server A, Server C
        │
        ▼
Pushes comment ONLY to Server A and Server C
        │
        ▼
Server A pushes to User B via SSE

The map stays accurate via heartbeats — RMS servers ping the Dispatcher regularly to confirm which videos their users are watching. Multiple Dispatcher instances run in parallel, sharing the same routing map via Zookeeper or etcd.

Tradeoffs:

  • ✅ Zero wasted messages — only the right servers receive each comment
  • ✅ No subscription management on the RMS side
  • ✅ Centralized routing — easy to add smart rules like load-based routing
  • ❌ Dispatcher is a new component that must be highly available
  • ❌ Keeping the map consistent during sudden traffic spikes is hard
  • ❌ More infrastructure — Zookeeper or etcd on top of everything else

Use when: Massive scale with strict efficiency requirements. Mention as an advanced option if the interviewer pushes.


Approach Comparison

Approach 1 Approach 2 Approach 3
RMS subscribes to Everything Only relevant channels Nothing — Dispatcher pushes
Wasted messages ❌ Tons ⚠️ Some (hash collisions) ✅ Zero
Best backend Kafka or Redis Redis Zookeeper / etcd
Kafka works? ✅ Yes ⚠️ Slow rebalancing ⚠️ Buffer only
Complexity Low Medium High
Use when Prototype Medium–large scale Massive scale

Kafka vs Redis for Live Comments

Kafka Redis Pub/Sub
Durability ✅ Persisted to disk ❌ Fire-and-forget
Latency ⚠️ Higher ✅ Very low
Dynamic subscriptions ❌ Slow rebalancing ✅ Instant
Best for Durable event streams, audit logs Real-time fan-out, live messaging
Fits live comments? ⚠️ Only with stable subscriptions ✅ Yes

Interview Recommendation

Go with Approach 2 using Redis. It's efficient, handles users switching videos cleanly, and is straightforward to explain. Mention Approach 3 only if the interviewer asks how you'd handle Facebook or YouTube scale.


Case Study 2: Inventory Management System

The problem: Track stock levels across thousands of stores and hundreds of millions of SKUs. Prevent overselling at checkout while keeping catalog display fast under massive read traffic.


Requirements

Functional:

  • Track stock per SKU per store and warehouse
  • Update inventory on: purchase, return, restock, damage/write-off
  • Query: "Is SKU X available at store Y?" and "How many units across region Z?"
  • Reserve inventory at checkout, confirm on payment, release on timeout

Non-functional:

  • Consistency: Strong at checkout (no overselling), eventual for catalog display
  • Scale: 10,000+ stores, 100M+ SKUs, billions of transactions/day
  • Availability: 99.99%
  • Latency: <50ms availability check, <200ms reservation

Clarifying questions to ask:

  • "Is the source of truth per-store or globally aggregated?"
  • "Do we need real-time sync across stores or is eventual ok for display?"
  • "What's the SLA for reservation expiry?" — drives lock design

Architecture

                         ┌─────────────────────────────────────────────┐
                         │              API Gateway                     │
                         └───────────────────┬─────────────────────────┘
                                             │
              ┌──────────────────────────────┼──────────────────────────────┐
              │                             │                              │
    ┌─────────▼────────┐       ┌────────────▼────────┐       ┌────────────▼──────────┐
    │  Inventory Read  │       │  Inventory Write     │       │  Reservation Service  │
    │  Service         │       │  Service             │       │                       │
    │ (high read QPS)  │       │ (strong consistency) │       │  (checkout flow)      │
    └─────────┬────────┘       └────────────┬────────┘       └────────────┬──────────┘
              │                             │                              │
    ┌─────────▼────────┐       ┌────────────▼────────┐       ┌────────────▼──────────┐
    │  Redis Cache      │       │  Inventory DB        │       │  Reservation DB       │
    │  (hot SKUs)       │       │  (sharded by         │       │  (TTL per session)    │
    │                  │       │   store_id)           │       │                       │
    └──────────────────┘       └────────────┬────────┘       └───────────────────────┘
                                            │
                               ┌────────────▼────────┐
                               │  Kafka Event Bus     │
                               │  topic: inv.updated  │
                               └────────────┬────────┘
                         ┌──────────────────┴────────────────┐
              ┌──────────▼──────────┐             ┌──────────▼──────────┐
              │  Search Index Sync  │             │  Analytics Pipeline  │
              │  (Elasticsearch)    │             │  (replenishment,     │
              └─────────────────────┘             │   forecasting)       │
                                                  └─────────────────────┘

Component Design

Inventory DB Schema

-- Sharded by store_id
CREATE TABLE inventory (
    store_id       VARCHAR(36)    NOT NULL,
    sku_id         VARCHAR(36)    NOT NULL,
    quantity       INT            NOT NULL DEFAULT 0,
    reserved_qty   INT            NOT NULL DEFAULT 0,
    version        BIGINT         NOT NULL DEFAULT 0,   -- optimistic locking
    updated_at     TIMESTAMP      NOT NULL,
    PRIMARY KEY (store_id, sku_id)
);

CREATE TABLE reservation (
    reservation_id  VARCHAR(36)   PRIMARY KEY,
    store_id        VARCHAR(36)   NOT NULL,
    sku_id          VARCHAR(36)   NOT NULL,
    quantity        INT           NOT NULL,
    session_id      VARCHAR(36)   NOT NULL,
    expires_at      TIMESTAMP     NOT NULL,
    status          ENUM('ACTIVE','CONFIRMED','EXPIRED','RELEASED')
);

Reservation Service

function reserve(storeId, skuId, qty, sessionId):
    for attempt in range(maxRetries = 3):
        try:
            return attemptReservation(storeId, skuId, qty, sessionId)
        catch OptimisticLockConflict:
            if attempt == maxRetries - 1:
                throw InventoryConflict("Failed after max retries")
            sleep(50ms * (attempt + 1))

function attemptReservation(storeId, skuId, qty, sessionId):
    inv = db.findInventory(storeId, skuId)
    available = inv.quantity - inv.reservedQty
    if available < qty:
        throw InsufficientStock("Only " + available + " available")

    inv.reservedQty += qty
    inv.version += 1
    db.saveWithVersionCheck(inv)   // throws OptimisticLockConflict if version changed

    reservation = { id: newUUID(), storeId, skuId, qty, sessionId,
                    expiresAt: now() + 10min, status: ACTIVE }
    db.save(reservation)
    kafka.send("inventory.reserved", { storeId, skuId, qty })
    return reservation

function confirmReservation(reservationId):
    res = db.findReservation(reservationId)
    if res.status != ACTIVE: throw IllegalState(res.status)

    inv = db.findInventory(res.storeId, res.skuId)
    inv.quantity    -= res.quantity
    inv.reservedQty -= res.quantity
    db.save(inv)

    res.status = CONFIRMED
    db.save(res)
    kafka.send("inventory.updated", { storeId: res.storeId, skuId: res.skuId, event: "SOLD" })

// Scheduled job — runs every 60 seconds
function releaseExpiredReservations():
    expired = db.findAll(status: ACTIVE, expiresAt < now())
    for res in expired:
        inv = db.findInventory(res.storeId, res.skuId)
        inv.reservedQty -= res.quantity
        db.save(inv)
        res.status = EXPIRED
        db.save(res)

Caching Strategy

// Cache-aside for catalog display — NOT used at checkout
function getAvailableQty(storeId, skuId):
    cacheKey = "inv:" + storeId + ":" + skuId
    cached = redis.get(cacheKey)
    if cached != null: return cached

    inv = db.findInventory(storeId, skuId)
    available = inv.quantity - inv.reservedQty
    redis.set(cacheKey, available, ttl: 30s)
    return available

// Called by Kafka consumer after any write
function invalidateCache(storeId, skuId):
    redis.delete("inv:" + storeId + ":" + skuId)

Trade-offs

Decision Option A Option B Winner & Why
Locking at checkout Optimistic (version field) Pessimistic (SELECT FOR UPDATE) Optimistic — low contention for most SKUs; pessimistic locks rows and hurts throughput at scale
Cache invalidation TTL-based (30s) Event-driven (on write) Both — TTL as safety net + Kafka event to invalidate immediately on update
Consistency at checkout Read from DB Read from cache DB always at checkout — stale cache = oversell
Sharding key store_id sku_id store_id — checkout queries are store-scoped; avoids cross-shard joins
Reservation expiry Scheduled job DB TTL / triggers Scheduled job — observable, easy to monitor; DB triggers are hidden and hard to debug
Event bus Kafka RabbitMQ Kafka at scale — durable replay, ordered partitions by store_id, consumer lag visibility

Interviewer Q&A

Q: What happens if a reservation expires but the user already paid?

Payment and reservation services are separate. If payment succeeds after expiry, check at confirmReservation() whether actual stock still exists — not just reserved qty. If it does: confirm the sale. If not: trigger order cancellation and full refund. Minimize this edge case by extending TTL when payment is initiated.

Q: How do you handle 100x traffic on a peak shopping day?

Three layers: (1) Pre-warm Redis with top projected hot SKUs the night before. (2) Pre-scale read replicas and DB connection pools before traffic ramps. (3) Circuit breaker — if DB latency exceeds 200ms, serve from cache for display; still require DB at checkout. Add a virtual queue for checkout if the reservation service hits capacity.

Q: What if two users try to reserve the last item simultaneously?

Optimistic locking handles this exactly. Both read version=5, both try to write version=6. DB rejects the second write (version mismatch). Second request retries, now sees available qty is zero, throws InsufficientStock. User sees "Sorry, this item just sold out."

Q: How would you design inventory updates for returns?

ReturnService publishes an inventory.return event to Kafka. InventoryWriteService consumes and increments quantity. Idempotency: each return has a return_id — check if already processed before incrementing. Cache invalidated immediately so the item shows as available again.


Senior vs Staff Bar

Area Senior SWE Staff SWE
Consistency "Use DB at checkout" Explains CAP trade-off, discusses CRDT for multi-region eventual consistency
Locking Uses optimistic locking Discusses lock contention at scale, proposes inventory partitioning to reduce hotspots
Scale "Shard by store_id" Discusses resharding strategy as stores grow, hotspot detection, cross-shard queries
Failure "Reservation expires" Proposes full saga compensation chain with dead-letter queue and ops dashboard

Case Study 3: Payment & Settlement System

The problem: Process payments reliably at scale with exactly-once semantics — a double charge or a missed charge has direct financial and legal consequences.


Requirements

Functional:

  • Accept payment (card, PayPal, digital wallet, gift card)
  • Route to appropriate payment gateway
  • Confirm or decline transaction
  • Handle refunds, partial refunds, chargebacks
  • Settlement: reconcile with banks/networks daily

Non-functional:

  • Exactly-once processing — never double-charge
  • Latency: <500ms p99 end-to-end
  • Availability: 99.999% (five nines)
  • Compliance: PCI-DSS Level 1
  • Auditability: Every state change permanently recorded

Clarifying questions to ask:

  • "Are we building the gateway or integrating with existing ones (Stripe/Adyen)?"
  • "Do we need multi-currency support?"
  • "What's the chargeback SLA?" — drives settlement design

Architecture

                           ┌────────────────────────────┐
                           │       Checkout Service      │
                           └────────────┬───────────────┘
                                        │  POST /payments  {idempotency_key, amount, token}
                           ┌────────────▼───────────────┐
                           │      Payment Service        │
                           │  - Idempotency check        │
                           │  - Validation               │
                           │  - Route to gateway         │
                           └────────────┬───────────────┘
                  ┌──────────────────────┼──────────────────────┐
                  │                      │                      │
       ┌──────────▼───────┐  ┌──────────▼───────┐  ┌──────────▼───────┐
       │  Card Gateway    │  │  PayPal Gateway   │  │  Digital Wallet  │
       │  (Visa/MC/Amex)  │  │                  │  │  Gateway         │
       └──────────┬───────┘  └──────────┬───────┘  └──────────┬───────┘
                  └──────────────────────┼──────────────────────┘
                                         │  Response: APPROVED / DECLINED
                           ┌────────────▼───────────────┐
                           │       Ledger DB             │
                           │  (append-only, immutable)   │
                           └────────────┬───────────────┘
                                        │
                           ┌────────────▼───────────────┐
                           │       Kafka                 │
                           │  topic: payment.events      │
                           └──────┬──────────┬──────────┘
                    ┌─────────────▼──┐  ┌────▼──────────────────┐
                    │  Order Service │  │  Settlement Service    │
                    │  (confirm)     │  │  (async reconciliation)│
                    └────────────────┘  └───────────────────────┘

Component Design

Ledger DB Schema

-- Append-only — NEVER UPDATE or DELETE
CREATE TABLE payment_ledger (
    ledger_id        VARCHAR(36)   PRIMARY KEY,
    payment_id       VARCHAR(36)   NOT NULL,
    idempotency_key  VARCHAR(128)  NOT NULL UNIQUE,
    event_type       ENUM('INITIATED','AUTHORIZED','CAPTURED',
                         'DECLINED','REFUND_INITIATED',
                         'REFUNDED','CHARGEBACK') NOT NULL,
    amount           DECIMAL(12,2) NOT NULL,
    currency         VARCHAR(3)    NOT NULL DEFAULT 'USD',
    gateway          VARCHAR(50)   NOT NULL,
    gateway_ref      VARCHAR(128),
    user_id          VARCHAR(36)   NOT NULL,
    order_id         VARCHAR(36)   NOT NULL,
    metadata         JSONB,
    created_at       TIMESTAMP     NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_payment_id ON payment_ledger(payment_id);
CREATE INDEX idx_order_id   ON payment_ledger(order_id);
CREATE INDEX idx_idem_key   ON payment_ledger(idempotency_key);

Payment Service

function processPayment(request):
    // 1. Idempotency check — return cached result if already processed
    cached = idempotencyStore.get(request.idempotencyKey)
    if cached exists: return cached

    // 2. Validate: amount > 0, token present, currency supported
    validate(request)

    // 3. Write INITIATED to ledger before calling gateway — always have a record
    paymentId = newUUID()
    ledger.save({ paymentId, idempotencyKey: request.idempotencyKey,
                  eventType: INITIATED, amount: request.amount, ... })

    try:
        gateway = router.route(request.paymentMethod)
        response = gateway.authorize(request.token, request.amount, request.currency)

        if response.approved:
            ledger.save({ paymentId, eventType: AUTHORIZED, gatewayRef: response.txId })
            // Outbox: write event atomically with ledger; background worker publishes to Kafka
            outbox.save({ topic: "payment.events", payload: { paymentId, event: "AUTHORIZED" } })
            result = Result.success(paymentId, response)
        else:
            ledger.save({ paymentId, eventType: DECLINED,
                          metadata: { declineCode: response.declineCode } })
            result = Result.declined(response.declineCode)

        idempotencyStore.put(request.idempotencyKey, result, ttl: 24h)
        return result

    catch GatewayTimeout:
        // Status unknown — do NOT tell client to retry with a new idempotency key
        ledger.save({ paymentId, eventType: PENDING_VERIFICATION,
                      metadata: { reason: "gateway_timeout" } })
        throw PaymentPending(paymentId, "Do not retry with new key")

function refund(paymentId, amount, reason):
    latest = ledger.findLatest(paymentId)
    if latest.eventType not in [AUTHORIZED, CAPTURED]:
        throw IllegalState("Cannot refund in state: " + latest.eventType)

    totalRefunded = ledger.sumRefunds(paymentId)
    if totalRefunded + amount > latest.amount:
        throw IllegalArg("Refund exceeds original payment")

    gateway = router.routeByRef(latest.gatewayRef)
    gateway.refund(latest.gatewayRef, amount)

    ledger.save({ paymentId, eventType: REFUNDED, amount: -amount, metadata: { reason } })
    outbox.save({ topic: "payment.events", payload: { paymentId, event: "REFUNDED" } })

Outbox Worker

// Runs every 5 seconds — guarantees Kafka delivery even if service crashes after DB write
function processOutbox():
    pending = outbox.findUnpublished()
    for event in pending:
        try:
            kafka.send(event.topic, event.key, event.payload)
            event.published = true
            outbox.save(event)
        catch:
            // Retry next run — consumer idempotency handles dedup on the other side
            log.error("Failed to publish outbox event: " + event.id)

Trade-offs

Decision Option A Option B Winner & Why
Event delivery Kafka direct publish Outbox pattern Outbox — crash after DB write but before Kafka publish loses the event; outbox guarantees at-least-once
Idempotency store Redis DB table Redis for speed + DB as fallback; Redis TTL must exceed the max retry window
Ledger storage Append-only SQL Event store Append-only SQL — simpler to operate; event sourcing adds complexity without proportional benefit for payments
Refund path Sync (wait for gateway) Async (queue + notify) Async — refund SLA is hours/days; queue it, retry on failure, notify when done
Gateway timeout Fail the payment Mark PENDING, reconcile async Pending — failing causes user to retry with new key, potentially double-charging
PCI scope Store card data yourself Tokenize via gateway Tokenize always — never store raw card numbers; gateway handles PCI Level 1

Interviewer Q&A

Q: What if the gateway times out — did the charge go through?

Unknown. Mark as PENDING_VERIFICATION. Do NOT tell the client to retry with a new idempotency key — that could create a second charge. A reconciliation job runs every 5 minutes, queries the gateway by our reference ID to get the true status. If approved: update ledger, confirm order. If declined: update ledger, notify user. This is why we write INITIATED to the ledger before calling the gateway — we always have a record.

Q: How do you guarantee exactly-once processing?

Exactly-once = at-least-once + idempotent consumer. Idempotency key on the way in prevents re-processing. Outbox pattern + Kafka consumer idempotency on the way out ensures downstream services don't double-process even if we publish twice.

Q: How do you handle a chargeback 60 days later?

Gateway sends a webhook. ChargebackService looks up the original payment_id via gateway reference, writes a CHARGEBACK event to the append-only ledger (original charge is preserved), flags the user for fraud review, and settlement reconciliation adjusts the bank records for that day.

Q: How do you design for PCI-DSS compliance?

Four layers: (1) Tokenization — client sends card directly to gateway SDK, we only receive a token. (2) Network isolation — Payment Service in its own VPC, strict egress to approved gateway IPs only. (3) Encryption — AES-256 at rest, TLS 1.3 in transit. (4) No sensitive data in logs — mask card tokens and sensitive fields before logging.


Senior vs Staff Bar

Area Senior SWE Staff SWE
Idempotency Implements idempotency key Discusses clock skew, TTL expiry edge cases, cross-datacenter key synchronisation
Consistency Uses outbox pattern Explains why 2PC fails at scale, proposes saga + compensating transactions
PCI "Tokenize card data" Designs full PCI scope reduction: network segmentation, audit logging, key rotation strategy
Failure modes Handles timeout Proposes reconciliation architecture, SLA-based alerting, ops runbook
Settlement "Async settlement" Designs full pipeline: daily batch, bank file formats (ACH/NACHA), exception handling

Case Study 4: Checkout System

The problem: Coordinate inventory reservation, order creation, and payment as a single logical transaction across three independent services — without distributed transactions.


Requirements

Functional:

  • Cart management (add/remove/update items)
  • Apply coupons, promotions, loyalty discounts
  • Select shipping or pickup option
  • Process payment and confirm order

Non-functional:

  • Must not oversell (inventory) or double-charge (payment)
  • Idempotent end-to-end — user can click Pay multiple times safely
  • <2s total checkout latency

Clarifying questions to ask:

  • "Is cart stored server-side or client-side?"
  • "Do promotions stack? What's the ordering?"
  • "Is split payment supported (gift card + credit card)?"

Architecture — Saga Pattern

Client
  │
  ▼
Cart Service (Redis, session-scoped)
  │
  └──[POST /checkout/confirm]──► Checkout Orchestrator
                                      │
                              SAGA: Sequential with Compensation
                              │
                              ├── Step 1: Reserve Inventory ────► on fail: no-op
                              ├── Step 2: Create Order      ────► on fail: release reservation
                              ├── Step 3: Process Payment   ────► on fail: cancel order +
                              │                                            release reservation
                              └── Step 4: Confirm + Notify  ────► on fail: refund payment +
                                                                           cancel order +
                                                                           release reservation

Component Design

Checkout Saga Orchestrator

function checkout(request):
    checkoutId = request.checkoutId   // server-generated idempotency key, session-scoped
    reservation = null
    order = null

    try:
        // Step 1: Reserve inventory
        reservation = inventoryService.reserve(request.storeId,
                                               request.items, checkoutId)

        // Step 2: Create order in PENDING state
        order = orderService.createOrder(request.userId, request.items,
                                         reservation.id, checkoutId)

        // Step 3: Process payment
        payment = paymentService.processPayment({
            idempotencyKey: checkoutId + ":payment",
            amount: request.totalAmount,
            token: request.paymentToken,
            orderId: order.id
        })

        if not payment.success:
            // Compensate steps 1 and 2
            orderService.cancelOrder(order.id, "PAYMENT_DECLINED")
            inventoryService.releaseReservation(reservation.id)
            return Result.declined(payment.declineReason)

        // Step 4: Confirm everything and notify
        inventoryService.confirmReservation(reservation.id)
        orderService.confirmOrder(order.id, payment.paymentId)
        notificationService.sendOrderConfirmation(request.userId, order.id)
        return Result.success(order.id)

    catch InsufficientStock as e:
        return Result.outOfStock(e.skuId)   // nothing to compensate

    catch PaymentPending as e:
        // Gateway timed out — hold, don't cancel
        orderService.markPendingVerification(order.id, e.paymentId)
        return Result.pending("Payment is being verified. We'll email you.")

    catch Exception as e:
        // Full compensation on unexpected failure
        if order != null: orderService.cancelOrder(order.id, "SYSTEM_ERROR")
        if reservation != null: inventoryService.releaseReservation(reservation.id)
        throw CheckoutException("Checkout failed, please try again")

Cart Service

function addItem(sessionId, skuId, qty):
    cart = redis.get("cart:" + sessionId) or new Cart()
    // Soft check for UX — real lock happens at checkout, not here
    available = inventoryRead.getAvailableQty(cart.storeId, skuId)
    if available < qty: throw InsufficientStock(skuId)
    cart.addItem(skuId, qty)
    redis.set("cart:" + sessionId, cart, ttl: 24h)

function getCheckoutSummary(sessionId):
    cart = redis.get("cart:" + sessionId)
    // Re-validate all items — cart may be stale
    unavailable = cart.items
        .filter(item => inventoryRead.getAvailableQty(cart.storeId, item.skuId) < item.qty)
    if unavailable not empty: throw ItemsUnavailable(unavailable)
    return buildSummary(cart)

Trade-offs

Decision Option A Option B Winner & Why
Coordination pattern Orchestration (central orchestrator) Choreography (event-driven) Orchestration for checkout — easy to trace, clear compensation logic. Choreography suits fire-and-forget flows like notifications
Cart storage Redis (session-scoped) SQL DB (persistent) Redis — cart is ephemeral, needs sub-ms reads, high churn. DB adds unnecessary write pressure
Inventory check timing Only at checkout At add-to-cart + checkout Both — soft check at add-to-cart for UX; hard lock only at checkout. Never trust the cart-level check for the actual reservation
Checkout idempotency Server-generated, session-scoped User-generated UUID Server-generated — prevents two browser tabs from double-submitting

Interviewer Q&A

Q: User clicks "Place Order" twice in 1 second — what happens?

The checkout session generates a single checkoutId tied to the session when checkout starts. First click: saga begins. Second click: hits idempotency check in the Payment Service, returns the cached result. Both requests share the same Redis idempotency store. Result: one reservation, one charge, one order — always.

Q: What if the Order Service is down during checkout?

Step 2 fails. The orchestrator catches the exception and releases the inventory reservation. Payment is never called — no charge. User sees "Checkout failed, try again." A circuit breaker surfaces a user-friendly error immediately instead of waiting for timeouts.

Q: How do you handle split payment (gift card + credit card)?

The payment step processes a list of PaymentInstruments. Gift card is applied first (reduces remaining amount). Credit card handles the rest. Idempotency keys are checkoutId:payment:giftcard and checkoutId:payment:card separately. If credit card is declined after the gift card has been charged: compensation credits the gift card balance back.


Senior vs Staff Bar

Area Senior SWE Staff SWE
Saga Implements sequential saga Discusses choreography vs orchestration trade-offs for different sub-flows within checkout
Idempotency Session-level Cross-device idempotency — same user checking out from phone and laptop simultaneously
Failure modes Handles step failures Designs DLQ strategy, compensation monitoring, alerts when compensation itself fails
Scale Stateless services + load balancer Checkout as multi-region active-active, conflict resolution, global idempotency

Case Study 5: Video Streaming (VOD)

The problem: Upload, transcode, and stream video-on-demand to a global audience with adaptive quality and under 2 seconds to first frame.


Requirements

Functional:

  • Upload video content (studio side)
  • Transcode to multiple quality levels
  • Users browse catalog, stream videos
  • Support playback resume ("continue watching")

Non-functional:

  • <2s playback start time globally
  • Adaptive bitrate — auto-adjusts quality to network speed
  • 99.99% availability
  • Massive read-heavy workload (watch >> upload)

Clarifying questions to ask:

  • "Live streaming or VOD only?" → VOD is a different problem from live
  • "Global users?" → yes → CDN strategy becomes critical
  • "DRM (content protection) needed?" → yes for licensed content

Architecture

UPLOAD PATH:
Studio ──► Upload Service ──► Raw Video Storage (S3)
                                     │
                              Transcoding Queue (Kafka/SQS)
                                     │
                         Transcoding Workers (FFmpeg, auto-scaled)
                         Output: 360p / 720p / 1080p / 4K  (HLS segments)
                                     │
                         Processed Video Storage (S3) + CDN Origin

PLAYBACK PATH:
User ──► API Gateway ──► Video Metadata Service ──► Metadata DB
                    │         (manifest URL, title, entitlement check)
                    │
                    ▼
              CDN Edge Node ──[cache hit]──► HLS Segments → User
                    │[miss]
              CDN Origin (S3) ──► Signed URL ──► User
                    │
              [Player requests next segments using ABR — switches quality automatically]

Component Design

HLS Manifest Design

# Master manifest — returned to client first
# Lists all available quality levels and their bandwidth requirements

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=500000,RESOLUTION=640x360
360p/video.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=1280x720
720p/video.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=1920x1080
1080p/video.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=10000000,RESOLUTION=3840x2160
4k/video.m3u8
# Quality-level manifest (720p/video.m3u8)
# DRM key URI embedded — player fetches decryption key from license server

#EXT-X-TARGETDURATION:6
#EXT-X-KEY:METHOD=AES-128,URI="https://drm.api/key?vid=abc123&token=XYZ"

#EXTINF:6.0,
seg_000.ts        ← 6-second chunk
#EXTINF:6.0,
seg_001.ts
...
#EXT-X-ENDLIST

Transcoding Pipeline

// Triggered when raw video upload completes (S3 event → queue → this)
function onVideoUploaded(event):
    metadataRepo.updateStatus(event.videoId, PROCESSING)

    qualities = [360p, 720p, 1080p, 4k]   // each has width, height, target bitrate
    for quality in qualities:
        job = { jobId: newUUID(), videoId: event.videoId,
                inputS3Key: event.s3Key, quality, status: QUEUED }
        jobRepo.save(job)
        queue.send("transcoding-jobs", job)    // picked up by auto-scaled workers

// Called by a worker when one quality level finishes
function onQualityTranscoded(event):
    jobRepo.markCompleted(event.jobId)
    pendingCount = jobRepo.countPending(event.videoId)
    if pendingCount == 0:
        generateAndUploadMasterManifest(event.videoId)
        metadataRepo.updateStatus(event.videoId, AVAILABLE)
        if isProjectedHighTraffic(event.videoId):
            cdnWarmingService.warmEdgeNodes(event.videoId)

Playback Service

function getPlaybackInfo(videoId, user):
    video = metadataRepo.findById(videoId)

    if not entitlement.canWatch(user, video):
        throw Forbidden("Subscription required")

    // Signed CDN URL — expires in 4 hours, prevents hotlinking
    manifestUrl = cdn.signUrl(
        "https://cdn.example.com/videos/" + videoId + "/master.m3u8",
        ttl: 4h
    )

    drmToken = licenseService.issueToken(user.id, videoId)
    resumePosition = watchHistory.getLastPosition(user.id, videoId) or 0

    return { manifestUrl, drmToken, resumePosition,
             availableQualities: video.availableQualities }

Watch History and Resume

// Client sends a heartbeat every 10 seconds with current playback position
function updatePosition(userId, videoId, positionSeconds):
    redis.set("watch:" + userId + ":" + videoId, positionSeconds, ttl: 30 days)
    kafka.send("watch.events", { userId, videoId, positionSeconds })   // async → durable DB

function getLastPosition(userId, videoId):
    return redis.get("watch:" + userId + ":" + videoId)   // null if never watched

Trade-offs

Decision Option A Option B Winner & Why
Video format HLS DASH HLS for broadest device support (iOS requires it); support both in production if budget allows
Segment duration 2-second segments 6-second segments 6s — fewer manifest requests, better for stable connections. Use 2s for live streaming
CDN strategy Single CDN Multi-CDN Multi-CDN at scale — route by geography and performance, avoid single CDN outage
Transcoding timing On first view (lazy) Pre-transcoded upfront Pre-transcoded — user experience wins; buffering causes abandonment
Watch history DB only Redis + DB async Redis first — sub-ms reads for resume; DB for durability via Kafka

Interviewer Q&A

Q: How does Adaptive Bitrate (ABR) actually work?

The player downloads the master manifest first — a small text file listing all quality levels and their bandwidth requirements. It measures download speed from the time taken to download the last segment. If speed drops, it requests the next segment at a lower quality. Transition is seamless — segments at all qualities have the same duration, so the player just changes which URL it fetches next. All done client-side — the CDN doesn't know or care which quality the user is watching.

Q: 50 million users hit play at the same time. How do you handle it?

CDN pre-warming: push the first 5 minutes of segments to all edge nodes 30 minutes before the event starts. Add an origin shield (mid-tier cache layer) between CDN and S3 — prevents 50M cache misses hitting S3 simultaneously. Rate limit manifest requests at the edge. Under extreme load, cap new streams at 720p max — reduces bandwidth by 60% while keeping the experience acceptable.

Q: How do you handle geo-restricted content?

Defence in depth at three layers: (1) API level — entitlement check validates user's country against content's allowed regions. (2) CDN level — geo-restriction rules return 403 for restricted countries before the request reaches the service. (3) DRM level — license token is country-scoped; DRM server re-validates at license request time.


Senior vs Staff Bar

Area Senior SWE Staff SWE
CDN "Use CloudFront" Multi-CDN failover, cost optimisation, regional routing strategy
Transcoding Fan-out per quality ABR ladder optimisation, codec trade-offs (H.264 vs H.265 vs AV1), cost vs quality
Scale "CDN handles it" Origin shield design, pre-warming strategy, request coalescing on cache misses
DRM "Encrypt segments" Full DRM design: key rotation, multi-DRM support (Widevine + FairPlay)
Recommendations Out of scope Sketches recommendation pipeline: collaborative filtering, feature store, A/B testing

Case Study 6: Fraud Detection

The problem: Score every transaction in real time for fraud risk and block or flag it before payment is processed — with a total budget of under 100ms.


Requirements

Functional:

  • Score every transaction inline with the payment flow
  • Block high-risk transactions automatically
  • Flag medium-risk for async human review
  • Learn from confirmed fraud via a feedback loop

Non-functional:

  • Latency: <100ms total (user waits for this)
  • Recall: High — better to over-flag than miss fraud
  • Precision: False positive rate <1% — wrongly blocked orders hurt UX
  • Throughput: Handle peak transaction spikes without latency degradation

Clarifying questions to ask:

  • "In-house ML or third-party vendor (Kount, Sift)?" → in-house
  • "What's the acceptable false positive rate?"
  • "Real-time model updates or daily retraining?"

Architecture

Payment Request ──► Fraud Service (inline, <100ms budget)
                         │
              ┌──────────┼─────────────────────────────────┐
              │          │                                 │
    Rule Engine       Feature Store               ML Scoring Engine
    (<5ms)            (Redis, pre-computed)        (<90ms)
    Hard rules:       30d spend, avg order,        Gradient Boosting
    velocity,         device count,                or Logistic Regression
    geo, BIN, etc.    account age, etc.
              │          │                                 │
              └──────────┴─────────────────────────────────┘
                                    │
                          Decision Engine
                          Score > 0.9  → BLOCK
                          Score 0.5–0.9 → FLAG (human review, async)
                          Score < 0.5  → ALLOW

ASYNC FEEDBACK LOOP:
Confirmed fraud events → Kafka → Feature Store update + Model retraining

Component Design

Rule Engine

HARD_RULES = [
    VelocityRule,            // card used > 5x in 10 minutes
    GeoImpossibilityRule,    // purchase in NYC then LA 30 minutes later
    HighValueNewAccountRule, // order > $1000 from account < 7 days old
    BinCountryMismatchRule,  // card BIN country != shipping country
    DeviceFingerprintRule    // device fingerprint is on blacklist
]

function evaluateRules(ctx):
    for rule in HARD_RULES:
        result = rule.evaluate(ctx)
        if result.isBlock: return RuleResult.block(result.reason)
        if result.isFlag:  return RuleResult.flag(result.reason)
    return RuleResult.pass()

// Example: velocity rule
function VelocityRule.evaluate(ctx):
    key = "velocity:card:" + ctx.cardToken
    count = redis.incr(key)
    if count == 1: redis.expire(key, 10 minutes)   // set TTL on first use
    if count > 5: return block("Card used " + count + "x in 10 min")
    if count > 3: return flag("High velocity")
    return pass()

Feature Store

// Pre-computed user features — must read in <5ms, cache only
function getFeatures(userId):
    cached = redis.get("features:user:" + userId)
    if cached: return cached

    features = featureRepo.findByUserId(userId) or defaultForNewUser()
    redis.set("features:user:" + userId, features, ttl: 1h)
    return features

// Features pre-computed by a nightly batch job:
UserFeatures = {
    avgOrderValueLast30Days,
    orderCountLast30Days,
    distinctDevicesLast30Days,
    distinctShippingAddressesLast30Days,
    chargebackRateLast90Days,
    accountAgeInDays,
    hasVerifiedPhone,
    failedPaymentAttemptsLast24h
}

Full Fraud Evaluation

BLOCK_THRESHOLD = 0.90
FLAG_THRESHOLD  = 0.50

function evaluate(ctx):
    // Step 1: Rule engine — fast, short-circuit obvious fraud (<5ms)
    ruleResult = evaluateRules(ctx)
    if ruleResult.isBlock: return Decision.block(ruleResult.reason, score: 1.0)

    // Step 2: Load pre-computed features from Redis (<5ms)
    features = featureStore.getFeatures(ctx.userId)

    // Step 3: ML inference (<90ms)
    score = mlEngine.score(ctx, features)

    // Step 4: Decision
    if score >= BLOCK_THRESHOLD:
        return Decision.block("ML score: " + score, score)

    if score >= FLAG_THRESHOLD or ruleResult.isFlag:
        reviewQueue.enqueue(ctx, score)
        return Decision.allow(score)   // allow but send to human review queue

    return Decision.allow(score)

Trade-offs

Decision Option A Option B Winner & Why
Fraud check placement Inline (blocks payment) Async (payment goes through, reviewed after) Inline for block decisions — can't charge and then dispute. Flag decisions are async
ML model type Deep learning Gradient Boosting (XGBoost/LightGBM) Gradient Boosting — explainable, fast inference (<10ms), excellent on tabular data
Feature freshness Real-time computation Pre-computed store (1h stale) Pre-computed — real-time DB queries in the fraud hot path would blow the 100ms budget
False positive handling Block and explain Flag for review + allow Flag + allow for medium risk — better to review async than lose the sale
Model updates Online learning Batch retraining (daily) Batch daily — simpler, stable, easier to validate; online learning risks model poisoning

Interviewer Q&A

Q: Why use a rule engine at all if you have ML?

They're complementary. Rules are deterministic, fast, and auditable — a card used in three countries in one hour doesn't need a model. Rules are also explainable to customers who dispute a declined transaction. ML catches subtle patterns rules miss — unusual combinations of signals that each look normal individually. Rules also act as a safety net if the ML model drifts or has a bug. In practice: rule engine short-circuits first to save ML inference cost on clear-cut cases; ML handles the ambiguous middle.

Q: What if fraudsters learn your thresholds and stay just under them?

Thresholds are internal and slightly randomised. ML adapts — even if fraudsters evade one rule, their overall behaviour pattern still looks anomalous across 50+ features. Dynamic thresholds adjust based on recent fraud patterns. The feedback loop from human review continuously retrains the model.

Q: How do you handle a legitimate user incorrectly blocked?

Score 0.5–0.9 is flagged for async human review, not blocked — so the transaction still goes through. For hard blocks, customer service has a verification flow (OTP/ID check → manual approval). After review clears: update the user's feature store to reduce future false positives. Track false positive rate per rule — if any rule exceeds 2%, tune the threshold.


Senior vs Staff Bar

Area Senior SWE Staff SWE
Rule engine Implements rules Rule versioning, A/B testing new rules, safe rollback strategy
ML "Use a model" Model drift detection, champion/challenger deployment, explainability requirements
Features Lists features Feature leakage prevention, train/serve skew, feature monitoring in production
Scale "Pre-compute features" Stream processing pipeline (Flink), feature freshness SLAs, backfill on schema change
Feedback loop "Retrain daily" Full MLOps: labelling, validation, shadow mode deployment, gradual rollout

Case Study 7: Notification System

The problem: Deliver transactional notifications (OTP, order confirmation) in under 5 seconds and promotional campaigns at scale across SMS, push, and email — without spamming users or losing critical messages.


Requirements

Functional:

  • Send order confirmations, shipping updates, promo campaigns
  • Channels: SMS, push notification, email
  • User preferences: opt-out per channel per notification type
  • Retry on failure, track delivery status

Non-functional:

  • Transactional (OTP, order confirm): <5s delivery
  • Promotional campaigns: best-effort, delay is acceptable
  • At-least-once delivery (idempotent send)
  • Rate limiting — no notification spam per user

Clarifying questions to ask:

  • "Do we need read receipts / delivery receipts?"
  • "Multi-language / localisation support?"
  • "What's the peak marketing campaign size?" — drives queue sizing

Architecture

Event Sources:
  Order Service ──────────────┐
  Shipping Service ───────────┤
  Marketing Campaign ─────────┤
                              ▼
                  Kafka topic: notification.requested
                  (partitioned by user_id for ordering)
                              │
                  Notification Service
                  1. Deduplication check
                  2. Load user preferences
                  3. Rate limit check
                  4. Route by channel and priority
                              │
          ┌───────────────────┼──────────────────────┐
          │                   │                      │
    SMS Worker          Push Worker           Email Worker
    (Twilio/SNS)        (FCM/APNs)            (SES/SendGrid)
          │                   │                      │
          └───────────────────┼──────────────────────┘
                              │
                  Delivery Status DB + Dead-Letter Queue

Component Design

Notification Service

function processNotification(request):
    dedupKey = request.userId + ":" + request.eventId + ":" + request.channel

    // 1. Deduplication — skip if already processed
    if dedup.exists(dedupKey): return

    // 2. User preferences — skip if opted out
    prefs = userPrefs.get(request.userId)
    if not prefs.isEnabled(request.channel, request.notificationType): return

    // 3. Rate limiting — skip for CRITICAL priority (OTP, order confirm)
    if request.priority != CRITICAL:
        if not rateLimiter.tryAcquire(request.userId, request.channel):
            requeueWithDelay(request, delay: 30min)
            return

    // 4. Mark as processing (dedup window = 24h)
    dedup.mark(dedupKey, ttl: 24h)
    record = statusRepo.save({ ...request, status: PENDING })

    // 5. Route to channel worker and track result
    try:
        channelRouter.send(request, record.id)
        statusRepo.update(record.id, SENT)
    catch ChannelError as e:
        statusRepo.update(record.id, FAILED)
        dlq.enqueue(request, record.id, e.message)   // retry up to 3x with backoff

Rate Limiter (Token Bucket in Redis)

HOURLY_LIMITS = { EMAIL: 3, SMS: 2, PUSH: 10 }

function tryAcquire(userId, channel):
    key = "ratelimit:" + userId + ":" + channel + ":" + currentHour()
    count = redis.incr(key)
    if count == 1: redis.expire(key, 2h)   // 2h window covers the hour boundary
    return count <= HOURLY_LIMITS[channel]

Priority Topic Separation

// Critical — OTP, order confirmations
// Dedicated consumer group, not shared with promotional traffic
topic: notifications.critical
  partitions: 20    ← high parallelism for <5s delivery
  retention: 24h

// Promotional — lower priority, processed off-peak
topic: notifications.promotional
  partitions: 5
  retention: 7 days

Trade-offs

Decision Option A Option B Winner & Why
Delivery guarantee At-most-once At-least-once At-least-once + dedup — OTP must be delivered; dedup prevents double-send
Priority handling Single topic, field-based routing Separate topics per priority Separate topics — dedicated consumer groups; promo traffic never delays a critical OTP
Rate limiting Per-user per-hour Global per-channel Both — per-user prevents spam; global protects SMS gateway API rate limits
Failed notifications Drop after N retries Dead-letter queue + alert DLQ always — OTP must never be silently dropped; alert ops when DLQ depth grows

Interviewer Q&A

Q: A marketing campaign sends 50 emails to one user in an hour. How do you prevent this?

Rate limiter enforces max 3 emails/user/hour via Redis token bucket. Campaign Service pre-computes delivery schedule respecting rate limits before publishing to Kafka — instead of pushing 100M events simultaneously, it spreads them over hours. If a rate limit is hit, the notification is requeued with a delay, not dropped.

Q: How do you guarantee OTP delivery in under 5 seconds?

Dedicated notifications.critical topic with 20 partitions and its own consumer group, completely separate from promotional traffic. Rate limiting is bypassed for CRITICAL priority. Retry at 2 seconds on failure, max 3 attempts before DLQ + ops alert. Multi-gateway SMS fallback: if the primary provider is slow, auto-route to a backup.

Q: How do you track whether a push notification was opened?

When the user taps the notification, the app fires a notification.opened event with the notificationId embedded in the push payload. We correlate and update DeliveryRecord status from SENTOPENED. FCM/APNs also return receipt callbacks — DELIVERED and UNREGISTERED for dead device tokens — which we consume via webhooks and record in status.


Senior vs Staff Bar

Area Senior SWE Staff SWE
Reliability Rate limit + retry Full reliability budget: per-channel SLA, alert thresholds, DLQ runbook
Scale Kafka consumers Consumer lag monitoring, partition rebalancing, back-pressure handling
Personalisation User prefs check ML-based send-time optimisation — send when the user is most likely to engage
Compliance Opt-out check TCPA for SMS, CAN-SPAM for email, GDPR right-to-erasure
Observability Status DB Metrics: delivery rate per channel, p99 latency, bounce rate, DLQ depth

Master Reference

The HLD Interview Formula

1. Requirements (2 min)
   → Functional: what it does
   → Non-functional: consistency model, scale numbers, latency, availability
   → Ask 2–3 clarifying questions before drawing anything

2. High-level architecture (5 min)
   → Boxes: client, API gateway, services, DBs, cache, event bus
   → Label the data flows

3. Deep dive on the hardest part (15 min)
   → Pick the most interesting component and go deep
   → Show pseudocode if it clarifies the logic

4. Trade-offs (5 min)
   → "I chose X over Y because..."
   → Acknowledge the downside of your choice

5. Failure modes (5 min)
   → "What happens when [service] goes down?"
   → "What happens at 10x load?"

Consistency Decisions — Never Get This Wrong

Scenario Model Rationale
Checkout: reserve inventory Strong No overselling — ever
Checkout: payment processing Strong No double-charge — ever
Checkout: order confirmation Strong Order must exist before taking money
Product catalog / search Eventual 30s stale is fine for browsing
Cart contents Eventual Session-scoped; user can refresh
Fraud feature store Eventual (1h) Behaviour signals don't change in seconds
Notification delivery At-least-once Dedup at consumer; never drop an OTP
Watch history / resume Eventual 10s stale resume position is imperceptible

Key Design Patterns

Pattern When to Use Core Idea
Optimistic Locking Concurrent writes to the same row version field in entity; DB rejects stale writes
Outbox Pattern Guarantee Kafka publish after DB write Write to outbox table atomically; background worker publishes
Idempotency Key Payment, checkout, notifications Cache result keyed by UUID; return same response on retry
Circuit Breaker Calls to external gateways Fail fast when downstream is slow; restore automatically
Saga (Orchestration) Multi-service transaction Central orchestrator; compensating actions on each failure
Token Bucket Rate limiting Redis INCR + EXPIRE; atomic, distributed, sub-ms
Cache-Aside Read-heavy, occasionally updated data Check cache → miss → read DB → populate cache
Scheduled Job TTL cleanup, reconciliation Beats DB triggers — observable, testable, easy to monitor

High Traffic Event Checklist

When the interviewer asks "How do you handle 10x or 100x peak load?":

  • Pre-warm cache the night before — hot SKUs, user features, popular content segments
  • Pre-scale read replicas, gateway connections, and Kafka consumer groups before the window opens
  • Circuit breakers on all downstream calls — fail fast, don't cascade
  • Load shedding503 + Retry-After header beats a cascading failure
  • Feature flags — kill non-critical features (recommendations, personalisation) under extreme load
  • Freeze deploys 48 hours before — canary deploy any changes well in advance
  • War room — dedicated on-call team, pre-staged runbooks, clear escalation paths