Case Studies
Seven real-world system designs walked end-to-end — from requirements to architecture to the component decisions that separate a passing answer from a strong one.
- Covers live streaming, inventory, payments, checkout, video streaming, fraud detection, and notifications
- Every case study includes an architecture diagram, trade-offs table, interviewer Q&A, and a Senior vs Staff comparison
- The Master Reference at the end has the HLD interview formula, consistency decision rules, and key design patterns in quick-scan tables
Case studies in this guide
| # | Title |
|---|---|
| 1 | Live Comment Streaming |
| 2 | Inventory Management System |
| 3 | Payment Settlement System |
| 4 | Checkout System |
| 5 | Video Streaming (VOD) |
| 6 | Fraud Detection |
| 7 | Notification System |
Case Study 1: Live Comment Streaming
The problem: Millions of users watching a live video need to receive comments in real time. Two properties make this hard at scale:
- Read traffic is massive — millions of viewers receiving comments
- Write traffic is small — far fewer people actually posting
So we separate them. A dedicated layer of Realtime Messaging Servers (RMS) handles pushing comments to viewers, letting us scale reads and writes independently.
The Players
NGINX — Load Balancer Sits at the front. Routes each user connection to an RMS. Once the connection is handed off, NGINX is out of the picture — it never touches comments.
Realtime Messaging Server (RMS) Where all the logic lives. Each RMS:
- Accepts user connections
- Maintains a local in-memory map of who is watching what
- Subscribes to a message backend (Redis or Kafka) to receive new comments
- Pushes comments to the right users over open persistent connections
Redis / Kafka — Message Backend Carries comments from the comment service to the RMS servers. Choice of backend depends on the approach (see below).
SSE vs WebSocket for Live Comments
The user's browser opens a persistent connection to the RMS. The RMS holds it open and writes to it whenever a new comment arrives. Two technologies can do this:
| SSE | WebSocket | |
|---|---|---|
| Direction | Server → client only | Full duplex |
| Auto reconnect | ✅ Built in | ❌ Write it yourself |
| Catch-up on reconnect | ✅ Last-Event-ID built in |
❌ Build it yourself |
| Setup complexity | Low — plain HTTP | Higher — protocol upgrade |
| Best for | Live feeds, notifications | True bidirectional chat |
For live comments, SSE is the right choice. Viewers only receive — posting a comment is a separate regular HTTP POST. No need for the complexity of WebSockets.
How the RMS Pushes Comments Back
The RMS does not dial back to the user's IP. The client always initiates the connection. The RMS holds that open connection as an object and writes to it when a comment arrives:
// When user connects and says "watching video3"
SseEmitter emitter = new SseEmitter();
emitterMap.get("video3").add(emitter);
// When a comment arrives for video3
for (SseEmitter e : emitterMap.get("video3")) {
e.send(comment);
}
Note on X-Forwarded-For: NGINX passes the real client IP via this header, useful for rate limiting and logging. But for pushing messages back, no IP is needed — the RMS already holds the connection object.
X-Forwarded-For: 203.0.113.5, 10.0.0.1
↑ real user ↑ proxy
Handling Dropped Connections and Catch-Up
Connections drop — bad WiFi, phone going to sleep, network switches. The RMS detects the drop and cleans up:
emitter.onCompletion(() -> emitterMap.get("video3").remove(emitter));
emitter.onTimeout(() -> emitterMap.get("video3").remove(emitter));
emitter.onError(e -> emitterMap.get("video3").remove(emitter));
Reconnect: SSE EventSource reconnects automatically. No client-side code needed.
Catch-up on missed comments: SSE assigns an ID to every message:
id: 1001
data: {"comment": "hello"}
id: 1002
data: {"comment": "great video"}
When the browser reconnects, it automatically sends Last-Event-ID: 1002. The RMS fetches all comments after ID 1002 from the database, sends them first, then resumes the live feed.
HTTP catch-up as an alternative: For more control, the client can track the last comment ID locally (localStorage on web, app storage on mobile) and explicitly request catch-up via HTTP:
GET /comments/{videoId}?since={lastCommentId}&limit=100
This lets the client control the UX — instead of dumping missed comments instantly, it can animate them smoothly at 2-3x speed or show "You missed 47 comments" with an option to jump straight to live.
Replay limits: Set a practical cap — replaying the last 5 minutes is reasonable. For longer gaps, degrade gracefully: show "You were away for a while — jump to live?" rather than flooding the user with thousands of stale comments.
Cross-server catch-up: When a user reconnects, load balancing may route them to a different RMS than before. That new server has no local state for this user. Solution: keep recent comments for each video in a shared Redis cache (e.g. comments:video3 as a sorted set by ID). Any RMS can replay recent comments for any video without hitting the DB. Redis TTL of 10-15 minutes covers the practical reconnect window.
Deduplication: If comments arrive via SSE while an HTTP catch-up response is still in flight, the client may receive the same comment twice. The client must deduplicate by comment ID before rendering — merge the two streams and discard duplicates.
Mobile proactive disconnect: Rather than waiting for the OS to kill the SSE connection when the app backgrounds, the client can proactively disconnect when it detects an onPause / willResignActive event and record the last comment ID. On foreground, reconnect and catch up. More efficient than maintaining a connection the OS will throttle anyway.
SSE Last-Event-ID |
HTTP ?since= catch-up |
|
|---|---|---|
| Automatic | ✅ Browser handles it | ❌ Client code needed |
| UX control | ❌ Dumps missed comments | ✅ Can animate or summarize |
| Works across servers | Needs shared Redis/DB | Needs shared Redis/DB |
The Full Flow
User B opens browser
│
▼
NGINX routes to an RMS (based on videoId)
│
▼
RMS accepts SSE connection from User B
│
├── adds to local map: { video3: [userB] }
└── subscribes to message backend for video3 comments
│
▼
Someone posts a comment on video3
│
▼
Comment Service publishes to Redis / Kafka
│
▼
RMS receives comment → checks local map
→ pushes to User B via open SSE connection
The Scaling Problem
At small scale, one RMS handles everything fine. At large scale (millions of viewers, many RMS servers):
When a comment is posted, how does the right RMS know about it?
This is the pub/sub problem. Three approaches solve it with different tradeoffs.
Approach 1 — Naive Pub/Sub
Every RMS subscribes to one single channel from the start. When any comment is posted, every RMS receives it, checks its local map, and discards it if none of its users are watching that video.
Comment on Video 3 posted
│
▼
Published to one channel → received by ALL servers
Server A has video3 viewers ✅ pushes to User B
Server B no video3 viewers ❌ discards
Server C no video3 viewers ❌ discards
| Backend | Role |
|---|---|
| Kafka | One topic — all RMS servers consume every message |
| Redis | One channel (comments:all) — all RMS servers subscribe |
Tradeoffs:
- ✅ Simple to build — no routing logic
- ✅ Easy to scale — just subscribe to the same channel
- ❌ Wasteful — every server processes every comment regardless of relevance
- ❌ Doesn't scale at Facebook/YouTube traffic levels
Use when: Early prototype or platforms with very few concurrent live videos.
Approach 2 — Smarter Subscriptions + Co-location
Videos are bucketed into N channels using a hash:
channel = hash(videoId) % N
hash("video3") % 100 = channel_7
hash("video9") % 100 = channel_7 ← same channel, different video
hash("video5") % 100 = channel_42
When User B connects and says "watching Video 3":
- RMS computes
hash(video3) % N→channel_7 - RMS subscribes to
channel_7in Redis (if not already subscribed) - RMS adds User B to its local map
Only servers subscribed to channel_7 receive those comments.
The co-location problem: With round-robin load balancing, a single RMS could end up with viewers watching hundreds of different videos, forcing it to subscribe to hundreds of channels — killing the efficiency gain.
Fix — consistent hashing on NGINX: Route all viewers of the same video to the same RMS. Everyone watching Video 3 always lands on Server A.
Everyone watching Video 3 → always Server A → subscribes to channel_7
Everyone watching Video 9 → always Server B → subscribes to channel_7
Everyone watching Video 5 → always Server C → subscribes to channel_42
| Backend | Fit | Why |
|---|---|---|
| Redis | ✅ Best fit | Subscribe/unsubscribe instantly, low latency, handles users switching videos cleanly |
| Kafka | ⚠️ Painful | Consumer group rebalancing is slow — not designed for dynamic subscriptions |
Redis fire-and-forget is fine here because comments are persisted to the DB the moment they're posted. Anything missed during a brief disconnect is recovered via Last-Event-ID catch-up.
Tradeoffs:
- ✅ Much more efficient — servers only receive relevant comments
- ✅ Redis handles dynamic subscriptions cleanly
- ✅ Dropped Redis messages are fine — DB has them, catch-up handles the rest
- ❌ NGINX needs consistent hashing configured, not just round-robin
- ❌ Hash collisions mean servers occasionally receive some irrelevant comments
- ❌ Kafka is a poor fit due to slow rebalancing
Use when: Medium to large scale. Redis is the right backbone. Best answer in an interview.
Approach 3 — Dispatcher Service
Instead of RMS servers subscribing and pulling, a Dispatcher Service pushes comments directly to only the servers that need them.
When User B connects and says "watching Video 3", the RMS registers with the Dispatcher: "I have a viewer for Video 3." The Dispatcher maintains a live routing map:
{
"video3": [Server A, Server C],
"video9": [Server B]
}
When a comment on Video 3 is posted:
Comment Service → asks Dispatcher "who has Video 3 viewers?"
│
▼
Dispatcher looks up map → Server A, Server C
│
▼
Pushes comment ONLY to Server A and Server C
│
▼
Server A pushes to User B via SSE
The map stays accurate via heartbeats — RMS servers ping the Dispatcher regularly to confirm which videos their users are watching. Multiple Dispatcher instances run in parallel, sharing the same routing map via Zookeeper or etcd.
Tradeoffs:
- ✅ Zero wasted messages — only the right servers receive each comment
- ✅ No subscription management on the RMS side
- ✅ Centralized routing — easy to add smart rules like load-based routing
- ❌ Dispatcher is a new component that must be highly available
- ❌ Keeping the map consistent during sudden traffic spikes is hard
- ❌ More infrastructure — Zookeeper or etcd on top of everything else
Use when: Massive scale with strict efficiency requirements. Mention as an advanced option if the interviewer pushes.
Approach Comparison
| Approach 1 | Approach 2 | Approach 3 | |
|---|---|---|---|
| RMS subscribes to | Everything | Only relevant channels | Nothing — Dispatcher pushes |
| Wasted messages | ❌ Tons | ⚠️ Some (hash collisions) | ✅ Zero |
| Best backend | Kafka or Redis | Redis | Zookeeper / etcd |
| Kafka works? | ✅ Yes | ⚠️ Slow rebalancing | ⚠️ Buffer only |
| Complexity | Low | Medium | High |
| Use when | Prototype | Medium–large scale | Massive scale |
Kafka vs Redis for Live Comments
| Kafka | Redis Pub/Sub | |
|---|---|---|
| Durability | ✅ Persisted to disk | ❌ Fire-and-forget |
| Latency | ⚠️ Higher | ✅ Very low |
| Dynamic subscriptions | ❌ Slow rebalancing | ✅ Instant |
| Best for | Durable event streams, audit logs | Real-time fan-out, live messaging |
| Fits live comments? | ⚠️ Only with stable subscriptions | ✅ Yes |
Interview Recommendation
Go with Approach 2 using Redis. It's efficient, handles users switching videos cleanly, and is straightforward to explain. Mention Approach 3 only if the interviewer asks how you'd handle Facebook or YouTube scale.
Case Study 2: Inventory Management System
The problem: Track stock levels across thousands of stores and hundreds of millions of SKUs. Prevent overselling at checkout while keeping catalog display fast under massive read traffic.
Requirements
Functional:
- Track stock per SKU per store and warehouse
- Update inventory on: purchase, return, restock, damage/write-off
- Query: "Is SKU X available at store Y?" and "How many units across region Z?"
- Reserve inventory at checkout, confirm on payment, release on timeout
Non-functional:
- Consistency: Strong at checkout (no overselling), eventual for catalog display
- Scale: 10,000+ stores, 100M+ SKUs, billions of transactions/day
- Availability: 99.99%
- Latency: <50ms availability check, <200ms reservation
Clarifying questions to ask:
- "Is the source of truth per-store or globally aggregated?"
- "Do we need real-time sync across stores or is eventual ok for display?"
- "What's the SLA for reservation expiry?" — drives lock design
Architecture
┌─────────────────────────────────────────────┐
│ API Gateway │
└───────────────────┬─────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌─────────▼────────┐ ┌────────────▼────────┐ ┌────────────▼──────────┐
│ Inventory Read │ │ Inventory Write │ │ Reservation Service │
│ Service │ │ Service │ │ │
│ (high read QPS) │ │ (strong consistency) │ │ (checkout flow) │
└─────────┬────────┘ └────────────┬────────┘ └────────────┬──────────┘
│ │ │
┌─────────▼────────┐ ┌────────────▼────────┐ ┌────────────▼──────────┐
│ Redis Cache │ │ Inventory DB │ │ Reservation DB │
│ (hot SKUs) │ │ (sharded by │ │ (TTL per session) │
│ │ │ store_id) │ │ │
└──────────────────┘ └────────────┬────────┘ └───────────────────────┘
│
┌────────────▼────────┐
│ Kafka Event Bus │
│ topic: inv.updated │
└────────────┬────────┘
┌──────────────────┴────────────────┐
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ Search Index Sync │ │ Analytics Pipeline │
│ (Elasticsearch) │ │ (replenishment, │
└─────────────────────┘ │ forecasting) │
└─────────────────────┘
Component Design
Inventory DB Schema
-- Sharded by store_id
CREATE TABLE inventory (
store_id VARCHAR(36) NOT NULL,
sku_id VARCHAR(36) NOT NULL,
quantity INT NOT NULL DEFAULT 0,
reserved_qty INT NOT NULL DEFAULT 0,
version BIGINT NOT NULL DEFAULT 0, -- optimistic locking
updated_at TIMESTAMP NOT NULL,
PRIMARY KEY (store_id, sku_id)
);
CREATE TABLE reservation (
reservation_id VARCHAR(36) PRIMARY KEY,
store_id VARCHAR(36) NOT NULL,
sku_id VARCHAR(36) NOT NULL,
quantity INT NOT NULL,
session_id VARCHAR(36) NOT NULL,
expires_at TIMESTAMP NOT NULL,
status ENUM('ACTIVE','CONFIRMED','EXPIRED','RELEASED')
);
Reservation Service
function reserve(storeId, skuId, qty, sessionId):
for attempt in range(maxRetries = 3):
try:
return attemptReservation(storeId, skuId, qty, sessionId)
catch OptimisticLockConflict:
if attempt == maxRetries - 1:
throw InventoryConflict("Failed after max retries")
sleep(50ms * (attempt + 1))
function attemptReservation(storeId, skuId, qty, sessionId):
inv = db.findInventory(storeId, skuId)
available = inv.quantity - inv.reservedQty
if available < qty:
throw InsufficientStock("Only " + available + " available")
inv.reservedQty += qty
inv.version += 1
db.saveWithVersionCheck(inv) // throws OptimisticLockConflict if version changed
reservation = { id: newUUID(), storeId, skuId, qty, sessionId,
expiresAt: now() + 10min, status: ACTIVE }
db.save(reservation)
kafka.send("inventory.reserved", { storeId, skuId, qty })
return reservation
function confirmReservation(reservationId):
res = db.findReservation(reservationId)
if res.status != ACTIVE: throw IllegalState(res.status)
inv = db.findInventory(res.storeId, res.skuId)
inv.quantity -= res.quantity
inv.reservedQty -= res.quantity
db.save(inv)
res.status = CONFIRMED
db.save(res)
kafka.send("inventory.updated", { storeId: res.storeId, skuId: res.skuId, event: "SOLD" })
// Scheduled job — runs every 60 seconds
function releaseExpiredReservations():
expired = db.findAll(status: ACTIVE, expiresAt < now())
for res in expired:
inv = db.findInventory(res.storeId, res.skuId)
inv.reservedQty -= res.quantity
db.save(inv)
res.status = EXPIRED
db.save(res)
Caching Strategy
// Cache-aside for catalog display — NOT used at checkout
function getAvailableQty(storeId, skuId):
cacheKey = "inv:" + storeId + ":" + skuId
cached = redis.get(cacheKey)
if cached != null: return cached
inv = db.findInventory(storeId, skuId)
available = inv.quantity - inv.reservedQty
redis.set(cacheKey, available, ttl: 30s)
return available
// Called by Kafka consumer after any write
function invalidateCache(storeId, skuId):
redis.delete("inv:" + storeId + ":" + skuId)
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Locking at checkout | Optimistic (version field) | Pessimistic (SELECT FOR UPDATE) | Optimistic — low contention for most SKUs; pessimistic locks rows and hurts throughput at scale |
| Cache invalidation | TTL-based (30s) | Event-driven (on write) | Both — TTL as safety net + Kafka event to invalidate immediately on update |
| Consistency at checkout | Read from DB | Read from cache | DB always at checkout — stale cache = oversell |
| Sharding key | store_id | sku_id | store_id — checkout queries are store-scoped; avoids cross-shard joins |
| Reservation expiry | Scheduled job | DB TTL / triggers | Scheduled job — observable, easy to monitor; DB triggers are hidden and hard to debug |
| Event bus | Kafka | RabbitMQ | Kafka at scale — durable replay, ordered partitions by store_id, consumer lag visibility |
Interviewer Q&A
Q: What happens if a reservation expires but the user already paid?
Payment and reservation services are separate. If payment succeeds after expiry, check at
confirmReservation()whether actual stock still exists — not just reserved qty. If it does: confirm the sale. If not: trigger order cancellation and full refund. Minimize this edge case by extending TTL when payment is initiated.
Q: How do you handle 100x traffic on a peak shopping day?
Three layers: (1) Pre-warm Redis with top projected hot SKUs the night before. (2) Pre-scale read replicas and DB connection pools before traffic ramps. (3) Circuit breaker — if DB latency exceeds 200ms, serve from cache for display; still require DB at checkout. Add a virtual queue for checkout if the reservation service hits capacity.
Q: What if two users try to reserve the last item simultaneously?
Optimistic locking handles this exactly. Both read
version=5, both try to writeversion=6. DB rejects the second write (version mismatch). Second request retries, now sees available qty is zero, throwsInsufficientStock. User sees "Sorry, this item just sold out."
Q: How would you design inventory updates for returns?
ReturnServicepublishes aninventory.returnevent to Kafka.InventoryWriteServiceconsumes and increments quantity. Idempotency: each return has areturn_id— check if already processed before incrementing. Cache invalidated immediately so the item shows as available again.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| Consistency | "Use DB at checkout" | Explains CAP trade-off, discusses CRDT for multi-region eventual consistency |
| Locking | Uses optimistic locking | Discusses lock contention at scale, proposes inventory partitioning to reduce hotspots |
| Scale | "Shard by store_id" | Discusses resharding strategy as stores grow, hotspot detection, cross-shard queries |
| Failure | "Reservation expires" | Proposes full saga compensation chain with dead-letter queue and ops dashboard |
Case Study 3: Payment & Settlement System
The problem: Process payments reliably at scale with exactly-once semantics — a double charge or a missed charge has direct financial and legal consequences.
Requirements
Functional:
- Accept payment (card, PayPal, digital wallet, gift card)
- Route to appropriate payment gateway
- Confirm or decline transaction
- Handle refunds, partial refunds, chargebacks
- Settlement: reconcile with banks/networks daily
Non-functional:
- Exactly-once processing — never double-charge
- Latency: <500ms p99 end-to-end
- Availability: 99.999% (five nines)
- Compliance: PCI-DSS Level 1
- Auditability: Every state change permanently recorded
Clarifying questions to ask:
- "Are we building the gateway or integrating with existing ones (Stripe/Adyen)?"
- "Do we need multi-currency support?"
- "What's the chargeback SLA?" — drives settlement design
Architecture
┌────────────────────────────┐
│ Checkout Service │
└────────────┬───────────────┘
│ POST /payments {idempotency_key, amount, token}
┌────────────▼───────────────┐
│ Payment Service │
│ - Idempotency check │
│ - Validation │
│ - Route to gateway │
└────────────┬───────────────┘
┌──────────────────────┼──────────────────────┐
│ │ │
┌──────────▼───────┐ ┌──────────▼───────┐ ┌──────────▼───────┐
│ Card Gateway │ │ PayPal Gateway │ │ Digital Wallet │
│ (Visa/MC/Amex) │ │ │ │ Gateway │
└──────────┬───────┘ └──────────┬───────┘ └──────────┬───────┘
└──────────────────────┼──────────────────────┘
│ Response: APPROVED / DECLINED
┌────────────▼───────────────┐
│ Ledger DB │
│ (append-only, immutable) │
└────────────┬───────────────┘
│
┌────────────▼───────────────┐
│ Kafka │
│ topic: payment.events │
└──────┬──────────┬──────────┘
┌─────────────▼──┐ ┌────▼──────────────────┐
│ Order Service │ │ Settlement Service │
│ (confirm) │ │ (async reconciliation)│
└────────────────┘ └───────────────────────┘
Component Design
Ledger DB Schema
-- Append-only — NEVER UPDATE or DELETE
CREATE TABLE payment_ledger (
ledger_id VARCHAR(36) PRIMARY KEY,
payment_id VARCHAR(36) NOT NULL,
idempotency_key VARCHAR(128) NOT NULL UNIQUE,
event_type ENUM('INITIATED','AUTHORIZED','CAPTURED',
'DECLINED','REFUND_INITIATED',
'REFUNDED','CHARGEBACK') NOT NULL,
amount DECIMAL(12,2) NOT NULL,
currency VARCHAR(3) NOT NULL DEFAULT 'USD',
gateway VARCHAR(50) NOT NULL,
gateway_ref VARCHAR(128),
user_id VARCHAR(36) NOT NULL,
order_id VARCHAR(36) NOT NULL,
metadata JSONB,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_payment_id ON payment_ledger(payment_id);
CREATE INDEX idx_order_id ON payment_ledger(order_id);
CREATE INDEX idx_idem_key ON payment_ledger(idempotency_key);
Payment Service
function processPayment(request):
// 1. Idempotency check — return cached result if already processed
cached = idempotencyStore.get(request.idempotencyKey)
if cached exists: return cached
// 2. Validate: amount > 0, token present, currency supported
validate(request)
// 3. Write INITIATED to ledger before calling gateway — always have a record
paymentId = newUUID()
ledger.save({ paymentId, idempotencyKey: request.idempotencyKey,
eventType: INITIATED, amount: request.amount, ... })
try:
gateway = router.route(request.paymentMethod)
response = gateway.authorize(request.token, request.amount, request.currency)
if response.approved:
ledger.save({ paymentId, eventType: AUTHORIZED, gatewayRef: response.txId })
// Outbox: write event atomically with ledger; background worker publishes to Kafka
outbox.save({ topic: "payment.events", payload: { paymentId, event: "AUTHORIZED" } })
result = Result.success(paymentId, response)
else:
ledger.save({ paymentId, eventType: DECLINED,
metadata: { declineCode: response.declineCode } })
result = Result.declined(response.declineCode)
idempotencyStore.put(request.idempotencyKey, result, ttl: 24h)
return result
catch GatewayTimeout:
// Status unknown — do NOT tell client to retry with a new idempotency key
ledger.save({ paymentId, eventType: PENDING_VERIFICATION,
metadata: { reason: "gateway_timeout" } })
throw PaymentPending(paymentId, "Do not retry with new key")
function refund(paymentId, amount, reason):
latest = ledger.findLatest(paymentId)
if latest.eventType not in [AUTHORIZED, CAPTURED]:
throw IllegalState("Cannot refund in state: " + latest.eventType)
totalRefunded = ledger.sumRefunds(paymentId)
if totalRefunded + amount > latest.amount:
throw IllegalArg("Refund exceeds original payment")
gateway = router.routeByRef(latest.gatewayRef)
gateway.refund(latest.gatewayRef, amount)
ledger.save({ paymentId, eventType: REFUNDED, amount: -amount, metadata: { reason } })
outbox.save({ topic: "payment.events", payload: { paymentId, event: "REFUNDED" } })
Outbox Worker
// Runs every 5 seconds — guarantees Kafka delivery even if service crashes after DB write
function processOutbox():
pending = outbox.findUnpublished()
for event in pending:
try:
kafka.send(event.topic, event.key, event.payload)
event.published = true
outbox.save(event)
catch:
// Retry next run — consumer idempotency handles dedup on the other side
log.error("Failed to publish outbox event: " + event.id)
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Event delivery | Kafka direct publish | Outbox pattern | Outbox — crash after DB write but before Kafka publish loses the event; outbox guarantees at-least-once |
| Idempotency store | Redis | DB table | Redis for speed + DB as fallback; Redis TTL must exceed the max retry window |
| Ledger storage | Append-only SQL | Event store | Append-only SQL — simpler to operate; event sourcing adds complexity without proportional benefit for payments |
| Refund path | Sync (wait for gateway) | Async (queue + notify) | Async — refund SLA is hours/days; queue it, retry on failure, notify when done |
| Gateway timeout | Fail the payment | Mark PENDING, reconcile async | Pending — failing causes user to retry with new key, potentially double-charging |
| PCI scope | Store card data yourself | Tokenize via gateway | Tokenize always — never store raw card numbers; gateway handles PCI Level 1 |
Interviewer Q&A
Q: What if the gateway times out — did the charge go through?
Unknown. Mark as
PENDING_VERIFICATION. Do NOT tell the client to retry with a new idempotency key — that could create a second charge. A reconciliation job runs every 5 minutes, queries the gateway by our reference ID to get the true status. If approved: update ledger, confirm order. If declined: update ledger, notify user. This is why we writeINITIATEDto the ledger before calling the gateway — we always have a record.
Q: How do you guarantee exactly-once processing?
Exactly-once = at-least-once + idempotent consumer. Idempotency key on the way in prevents re-processing. Outbox pattern + Kafka consumer idempotency on the way out ensures downstream services don't double-process even if we publish twice.
Q: How do you handle a chargeback 60 days later?
Gateway sends a webhook.
ChargebackServicelooks up the originalpayment_idvia gateway reference, writes aCHARGEBACKevent to the append-only ledger (original charge is preserved), flags the user for fraud review, and settlement reconciliation adjusts the bank records for that day.
Q: How do you design for PCI-DSS compliance?
Four layers: (1) Tokenization — client sends card directly to gateway SDK, we only receive a token. (2) Network isolation — Payment Service in its own VPC, strict egress to approved gateway IPs only. (3) Encryption — AES-256 at rest, TLS 1.3 in transit. (4) No sensitive data in logs — mask card tokens and sensitive fields before logging.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| Idempotency | Implements idempotency key | Discusses clock skew, TTL expiry edge cases, cross-datacenter key synchronisation |
| Consistency | Uses outbox pattern | Explains why 2PC fails at scale, proposes saga + compensating transactions |
| PCI | "Tokenize card data" | Designs full PCI scope reduction: network segmentation, audit logging, key rotation strategy |
| Failure modes | Handles timeout | Proposes reconciliation architecture, SLA-based alerting, ops runbook |
| Settlement | "Async settlement" | Designs full pipeline: daily batch, bank file formats (ACH/NACHA), exception handling |
Case Study 4: Checkout System
The problem: Coordinate inventory reservation, order creation, and payment as a single logical transaction across three independent services — without distributed transactions.
Requirements
Functional:
- Cart management (add/remove/update items)
- Apply coupons, promotions, loyalty discounts
- Select shipping or pickup option
- Process payment and confirm order
Non-functional:
- Must not oversell (inventory) or double-charge (payment)
- Idempotent end-to-end — user can click Pay multiple times safely
- <2s total checkout latency
Clarifying questions to ask:
- "Is cart stored server-side or client-side?"
- "Do promotions stack? What's the ordering?"
- "Is split payment supported (gift card + credit card)?"
Architecture — Saga Pattern
Client
│
▼
Cart Service (Redis, session-scoped)
│
└──[POST /checkout/confirm]──► Checkout Orchestrator
│
SAGA: Sequential with Compensation
│
├── Step 1: Reserve Inventory ────► on fail: no-op
├── Step 2: Create Order ────► on fail: release reservation
├── Step 3: Process Payment ────► on fail: cancel order +
│ release reservation
└── Step 4: Confirm + Notify ────► on fail: refund payment +
cancel order +
release reservation
Component Design
Checkout Saga Orchestrator
function checkout(request):
checkoutId = request.checkoutId // server-generated idempotency key, session-scoped
reservation = null
order = null
try:
// Step 1: Reserve inventory
reservation = inventoryService.reserve(request.storeId,
request.items, checkoutId)
// Step 2: Create order in PENDING state
order = orderService.createOrder(request.userId, request.items,
reservation.id, checkoutId)
// Step 3: Process payment
payment = paymentService.processPayment({
idempotencyKey: checkoutId + ":payment",
amount: request.totalAmount,
token: request.paymentToken,
orderId: order.id
})
if not payment.success:
// Compensate steps 1 and 2
orderService.cancelOrder(order.id, "PAYMENT_DECLINED")
inventoryService.releaseReservation(reservation.id)
return Result.declined(payment.declineReason)
// Step 4: Confirm everything and notify
inventoryService.confirmReservation(reservation.id)
orderService.confirmOrder(order.id, payment.paymentId)
notificationService.sendOrderConfirmation(request.userId, order.id)
return Result.success(order.id)
catch InsufficientStock as e:
return Result.outOfStock(e.skuId) // nothing to compensate
catch PaymentPending as e:
// Gateway timed out — hold, don't cancel
orderService.markPendingVerification(order.id, e.paymentId)
return Result.pending("Payment is being verified. We'll email you.")
catch Exception as e:
// Full compensation on unexpected failure
if order != null: orderService.cancelOrder(order.id, "SYSTEM_ERROR")
if reservation != null: inventoryService.releaseReservation(reservation.id)
throw CheckoutException("Checkout failed, please try again")
Cart Service
function addItem(sessionId, skuId, qty):
cart = redis.get("cart:" + sessionId) or new Cart()
// Soft check for UX — real lock happens at checkout, not here
available = inventoryRead.getAvailableQty(cart.storeId, skuId)
if available < qty: throw InsufficientStock(skuId)
cart.addItem(skuId, qty)
redis.set("cart:" + sessionId, cart, ttl: 24h)
function getCheckoutSummary(sessionId):
cart = redis.get("cart:" + sessionId)
// Re-validate all items — cart may be stale
unavailable = cart.items
.filter(item => inventoryRead.getAvailableQty(cart.storeId, item.skuId) < item.qty)
if unavailable not empty: throw ItemsUnavailable(unavailable)
return buildSummary(cart)
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Coordination pattern | Orchestration (central orchestrator) | Choreography (event-driven) | Orchestration for checkout — easy to trace, clear compensation logic. Choreography suits fire-and-forget flows like notifications |
| Cart storage | Redis (session-scoped) | SQL DB (persistent) | Redis — cart is ephemeral, needs sub-ms reads, high churn. DB adds unnecessary write pressure |
| Inventory check timing | Only at checkout | At add-to-cart + checkout | Both — soft check at add-to-cart for UX; hard lock only at checkout. Never trust the cart-level check for the actual reservation |
| Checkout idempotency | Server-generated, session-scoped | User-generated UUID | Server-generated — prevents two browser tabs from double-submitting |
Interviewer Q&A
Q: User clicks "Place Order" twice in 1 second — what happens?
The checkout session generates a single
checkoutIdtied to the session when checkout starts. First click: saga begins. Second click: hits idempotency check in the Payment Service, returns the cached result. Both requests share the same Redis idempotency store. Result: one reservation, one charge, one order — always.
Q: What if the Order Service is down during checkout?
Step 2 fails. The orchestrator catches the exception and releases the inventory reservation. Payment is never called — no charge. User sees "Checkout failed, try again." A circuit breaker surfaces a user-friendly error immediately instead of waiting for timeouts.
Q: How do you handle split payment (gift card + credit card)?
The payment step processes a list of
PaymentInstruments. Gift card is applied first (reduces remaining amount). Credit card handles the rest. Idempotency keys arecheckoutId:payment:giftcardandcheckoutId:payment:cardseparately. If credit card is declined after the gift card has been charged: compensation credits the gift card balance back.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| Saga | Implements sequential saga | Discusses choreography vs orchestration trade-offs for different sub-flows within checkout |
| Idempotency | Session-level | Cross-device idempotency — same user checking out from phone and laptop simultaneously |
| Failure modes | Handles step failures | Designs DLQ strategy, compensation monitoring, alerts when compensation itself fails |
| Scale | Stateless services + load balancer | Checkout as multi-region active-active, conflict resolution, global idempotency |
Case Study 5: Video Streaming (VOD)
The problem: Upload, transcode, and stream video-on-demand to a global audience with adaptive quality and under 2 seconds to first frame.
Requirements
Functional:
- Upload video content (studio side)
- Transcode to multiple quality levels
- Users browse catalog, stream videos
- Support playback resume ("continue watching")
Non-functional:
- <2s playback start time globally
- Adaptive bitrate — auto-adjusts quality to network speed
- 99.99% availability
- Massive read-heavy workload (watch >> upload)
Clarifying questions to ask:
- "Live streaming or VOD only?" → VOD is a different problem from live
- "Global users?" → yes → CDN strategy becomes critical
- "DRM (content protection) needed?" → yes for licensed content
Architecture
UPLOAD PATH:
Studio ──► Upload Service ──► Raw Video Storage (S3)
│
Transcoding Queue (Kafka/SQS)
│
Transcoding Workers (FFmpeg, auto-scaled)
Output: 360p / 720p / 1080p / 4K (HLS segments)
│
Processed Video Storage (S3) + CDN Origin
PLAYBACK PATH:
User ──► API Gateway ──► Video Metadata Service ──► Metadata DB
│ (manifest URL, title, entitlement check)
│
▼
CDN Edge Node ──[cache hit]──► HLS Segments → User
│[miss]
CDN Origin (S3) ──► Signed URL ──► User
│
[Player requests next segments using ABR — switches quality automatically]
Component Design
HLS Manifest Design
# Master manifest — returned to client first
# Lists all available quality levels and their bandwidth requirements
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=500000,RESOLUTION=640x360
360p/video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=1280x720
720p/video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=1920x1080
1080p/video.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=10000000,RESOLUTION=3840x2160
4k/video.m3u8
# Quality-level manifest (720p/video.m3u8)
# DRM key URI embedded — player fetches decryption key from license server
#EXT-X-TARGETDURATION:6
#EXT-X-KEY:METHOD=AES-128,URI="https://drm.api/key?vid=abc123&token=XYZ"
#EXTINF:6.0,
seg_000.ts ← 6-second chunk
#EXTINF:6.0,
seg_001.ts
...
#EXT-X-ENDLIST
Transcoding Pipeline
// Triggered when raw video upload completes (S3 event → queue → this)
function onVideoUploaded(event):
metadataRepo.updateStatus(event.videoId, PROCESSING)
qualities = [360p, 720p, 1080p, 4k] // each has width, height, target bitrate
for quality in qualities:
job = { jobId: newUUID(), videoId: event.videoId,
inputS3Key: event.s3Key, quality, status: QUEUED }
jobRepo.save(job)
queue.send("transcoding-jobs", job) // picked up by auto-scaled workers
// Called by a worker when one quality level finishes
function onQualityTranscoded(event):
jobRepo.markCompleted(event.jobId)
pendingCount = jobRepo.countPending(event.videoId)
if pendingCount == 0:
generateAndUploadMasterManifest(event.videoId)
metadataRepo.updateStatus(event.videoId, AVAILABLE)
if isProjectedHighTraffic(event.videoId):
cdnWarmingService.warmEdgeNodes(event.videoId)
Playback Service
function getPlaybackInfo(videoId, user):
video = metadataRepo.findById(videoId)
if not entitlement.canWatch(user, video):
throw Forbidden("Subscription required")
// Signed CDN URL — expires in 4 hours, prevents hotlinking
manifestUrl = cdn.signUrl(
"https://cdn.example.com/videos/" + videoId + "/master.m3u8",
ttl: 4h
)
drmToken = licenseService.issueToken(user.id, videoId)
resumePosition = watchHistory.getLastPosition(user.id, videoId) or 0
return { manifestUrl, drmToken, resumePosition,
availableQualities: video.availableQualities }
Watch History and Resume
// Client sends a heartbeat every 10 seconds with current playback position
function updatePosition(userId, videoId, positionSeconds):
redis.set("watch:" + userId + ":" + videoId, positionSeconds, ttl: 30 days)
kafka.send("watch.events", { userId, videoId, positionSeconds }) // async → durable DB
function getLastPosition(userId, videoId):
return redis.get("watch:" + userId + ":" + videoId) // null if never watched
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Video format | HLS | DASH | HLS for broadest device support (iOS requires it); support both in production if budget allows |
| Segment duration | 2-second segments | 6-second segments | 6s — fewer manifest requests, better for stable connections. Use 2s for live streaming |
| CDN strategy | Single CDN | Multi-CDN | Multi-CDN at scale — route by geography and performance, avoid single CDN outage |
| Transcoding timing | On first view (lazy) | Pre-transcoded upfront | Pre-transcoded — user experience wins; buffering causes abandonment |
| Watch history | DB only | Redis + DB async | Redis first — sub-ms reads for resume; DB for durability via Kafka |
Interviewer Q&A
Q: How does Adaptive Bitrate (ABR) actually work?
The player downloads the master manifest first — a small text file listing all quality levels and their bandwidth requirements. It measures download speed from the time taken to download the last segment. If speed drops, it requests the next segment at a lower quality. Transition is seamless — segments at all qualities have the same duration, so the player just changes which URL it fetches next. All done client-side — the CDN doesn't know or care which quality the user is watching.
Q: 50 million users hit play at the same time. How do you handle it?
CDN pre-warming: push the first 5 minutes of segments to all edge nodes 30 minutes before the event starts. Add an origin shield (mid-tier cache layer) between CDN and S3 — prevents 50M cache misses hitting S3 simultaneously. Rate limit manifest requests at the edge. Under extreme load, cap new streams at 720p max — reduces bandwidth by 60% while keeping the experience acceptable.
Q: How do you handle geo-restricted content?
Defence in depth at three layers: (1) API level — entitlement check validates user's country against content's allowed regions. (2) CDN level — geo-restriction rules return 403 for restricted countries before the request reaches the service. (3) DRM level — license token is country-scoped; DRM server re-validates at license request time.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| CDN | "Use CloudFront" | Multi-CDN failover, cost optimisation, regional routing strategy |
| Transcoding | Fan-out per quality | ABR ladder optimisation, codec trade-offs (H.264 vs H.265 vs AV1), cost vs quality |
| Scale | "CDN handles it" | Origin shield design, pre-warming strategy, request coalescing on cache misses |
| DRM | "Encrypt segments" | Full DRM design: key rotation, multi-DRM support (Widevine + FairPlay) |
| Recommendations | Out of scope | Sketches recommendation pipeline: collaborative filtering, feature store, A/B testing |
Case Study 6: Fraud Detection
The problem: Score every transaction in real time for fraud risk and block or flag it before payment is processed — with a total budget of under 100ms.
Requirements
Functional:
- Score every transaction inline with the payment flow
- Block high-risk transactions automatically
- Flag medium-risk for async human review
- Learn from confirmed fraud via a feedback loop
Non-functional:
- Latency: <100ms total (user waits for this)
- Recall: High — better to over-flag than miss fraud
- Precision: False positive rate <1% — wrongly blocked orders hurt UX
- Throughput: Handle peak transaction spikes without latency degradation
Clarifying questions to ask:
- "In-house ML or third-party vendor (Kount, Sift)?" → in-house
- "What's the acceptable false positive rate?"
- "Real-time model updates or daily retraining?"
Architecture
Payment Request ──► Fraud Service (inline, <100ms budget)
│
┌──────────┼─────────────────────────────────┐
│ │ │
Rule Engine Feature Store ML Scoring Engine
(<5ms) (Redis, pre-computed) (<90ms)
Hard rules: 30d spend, avg order, Gradient Boosting
velocity, device count, or Logistic Regression
geo, BIN, etc. account age, etc.
│ │ │
└──────────┴─────────────────────────────────┘
│
Decision Engine
Score > 0.9 → BLOCK
Score 0.5–0.9 → FLAG (human review, async)
Score < 0.5 → ALLOW
ASYNC FEEDBACK LOOP:
Confirmed fraud events → Kafka → Feature Store update + Model retraining
Component Design
Rule Engine
HARD_RULES = [
VelocityRule, // card used > 5x in 10 minutes
GeoImpossibilityRule, // purchase in NYC then LA 30 minutes later
HighValueNewAccountRule, // order > $1000 from account < 7 days old
BinCountryMismatchRule, // card BIN country != shipping country
DeviceFingerprintRule // device fingerprint is on blacklist
]
function evaluateRules(ctx):
for rule in HARD_RULES:
result = rule.evaluate(ctx)
if result.isBlock: return RuleResult.block(result.reason)
if result.isFlag: return RuleResult.flag(result.reason)
return RuleResult.pass()
// Example: velocity rule
function VelocityRule.evaluate(ctx):
key = "velocity:card:" + ctx.cardToken
count = redis.incr(key)
if count == 1: redis.expire(key, 10 minutes) // set TTL on first use
if count > 5: return block("Card used " + count + "x in 10 min")
if count > 3: return flag("High velocity")
return pass()
Feature Store
// Pre-computed user features — must read in <5ms, cache only
function getFeatures(userId):
cached = redis.get("features:user:" + userId)
if cached: return cached
features = featureRepo.findByUserId(userId) or defaultForNewUser()
redis.set("features:user:" + userId, features, ttl: 1h)
return features
// Features pre-computed by a nightly batch job:
UserFeatures = {
avgOrderValueLast30Days,
orderCountLast30Days,
distinctDevicesLast30Days,
distinctShippingAddressesLast30Days,
chargebackRateLast90Days,
accountAgeInDays,
hasVerifiedPhone,
failedPaymentAttemptsLast24h
}
Full Fraud Evaluation
BLOCK_THRESHOLD = 0.90
FLAG_THRESHOLD = 0.50
function evaluate(ctx):
// Step 1: Rule engine — fast, short-circuit obvious fraud (<5ms)
ruleResult = evaluateRules(ctx)
if ruleResult.isBlock: return Decision.block(ruleResult.reason, score: 1.0)
// Step 2: Load pre-computed features from Redis (<5ms)
features = featureStore.getFeatures(ctx.userId)
// Step 3: ML inference (<90ms)
score = mlEngine.score(ctx, features)
// Step 4: Decision
if score >= BLOCK_THRESHOLD:
return Decision.block("ML score: " + score, score)
if score >= FLAG_THRESHOLD or ruleResult.isFlag:
reviewQueue.enqueue(ctx, score)
return Decision.allow(score) // allow but send to human review queue
return Decision.allow(score)
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Fraud check placement | Inline (blocks payment) | Async (payment goes through, reviewed after) | Inline for block decisions — can't charge and then dispute. Flag decisions are async |
| ML model type | Deep learning | Gradient Boosting (XGBoost/LightGBM) | Gradient Boosting — explainable, fast inference (<10ms), excellent on tabular data |
| Feature freshness | Real-time computation | Pre-computed store (1h stale) | Pre-computed — real-time DB queries in the fraud hot path would blow the 100ms budget |
| False positive handling | Block and explain | Flag for review + allow | Flag + allow for medium risk — better to review async than lose the sale |
| Model updates | Online learning | Batch retraining (daily) | Batch daily — simpler, stable, easier to validate; online learning risks model poisoning |
Interviewer Q&A
Q: Why use a rule engine at all if you have ML?
They're complementary. Rules are deterministic, fast, and auditable — a card used in three countries in one hour doesn't need a model. Rules are also explainable to customers who dispute a declined transaction. ML catches subtle patterns rules miss — unusual combinations of signals that each look normal individually. Rules also act as a safety net if the ML model drifts or has a bug. In practice: rule engine short-circuits first to save ML inference cost on clear-cut cases; ML handles the ambiguous middle.
Q: What if fraudsters learn your thresholds and stay just under them?
Thresholds are internal and slightly randomised. ML adapts — even if fraudsters evade one rule, their overall behaviour pattern still looks anomalous across 50+ features. Dynamic thresholds adjust based on recent fraud patterns. The feedback loop from human review continuously retrains the model.
Q: How do you handle a legitimate user incorrectly blocked?
Score 0.5–0.9 is flagged for async human review, not blocked — so the transaction still goes through. For hard blocks, customer service has a verification flow (OTP/ID check → manual approval). After review clears: update the user's feature store to reduce future false positives. Track false positive rate per rule — if any rule exceeds 2%, tune the threshold.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| Rule engine | Implements rules | Rule versioning, A/B testing new rules, safe rollback strategy |
| ML | "Use a model" | Model drift detection, champion/challenger deployment, explainability requirements |
| Features | Lists features | Feature leakage prevention, train/serve skew, feature monitoring in production |
| Scale | "Pre-compute features" | Stream processing pipeline (Flink), feature freshness SLAs, backfill on schema change |
| Feedback loop | "Retrain daily" | Full MLOps: labelling, validation, shadow mode deployment, gradual rollout |
Case Study 7: Notification System
The problem: Deliver transactional notifications (OTP, order confirmation) in under 5 seconds and promotional campaigns at scale across SMS, push, and email — without spamming users or losing critical messages.
Requirements
Functional:
- Send order confirmations, shipping updates, promo campaigns
- Channels: SMS, push notification, email
- User preferences: opt-out per channel per notification type
- Retry on failure, track delivery status
Non-functional:
- Transactional (OTP, order confirm): <5s delivery
- Promotional campaigns: best-effort, delay is acceptable
- At-least-once delivery (idempotent send)
- Rate limiting — no notification spam per user
Clarifying questions to ask:
- "Do we need read receipts / delivery receipts?"
- "Multi-language / localisation support?"
- "What's the peak marketing campaign size?" — drives queue sizing
Architecture
Event Sources:
Order Service ──────────────┐
Shipping Service ───────────┤
Marketing Campaign ─────────┤
▼
Kafka topic: notification.requested
(partitioned by user_id for ordering)
│
Notification Service
1. Deduplication check
2. Load user preferences
3. Rate limit check
4. Route by channel and priority
│
┌───────────────────┼──────────────────────┐
│ │ │
SMS Worker Push Worker Email Worker
(Twilio/SNS) (FCM/APNs) (SES/SendGrid)
│ │ │
└───────────────────┼──────────────────────┘
│
Delivery Status DB + Dead-Letter Queue
Component Design
Notification Service
function processNotification(request):
dedupKey = request.userId + ":" + request.eventId + ":" + request.channel
// 1. Deduplication — skip if already processed
if dedup.exists(dedupKey): return
// 2. User preferences — skip if opted out
prefs = userPrefs.get(request.userId)
if not prefs.isEnabled(request.channel, request.notificationType): return
// 3. Rate limiting — skip for CRITICAL priority (OTP, order confirm)
if request.priority != CRITICAL:
if not rateLimiter.tryAcquire(request.userId, request.channel):
requeueWithDelay(request, delay: 30min)
return
// 4. Mark as processing (dedup window = 24h)
dedup.mark(dedupKey, ttl: 24h)
record = statusRepo.save({ ...request, status: PENDING })
// 5. Route to channel worker and track result
try:
channelRouter.send(request, record.id)
statusRepo.update(record.id, SENT)
catch ChannelError as e:
statusRepo.update(record.id, FAILED)
dlq.enqueue(request, record.id, e.message) // retry up to 3x with backoff
Rate Limiter (Token Bucket in Redis)
HOURLY_LIMITS = { EMAIL: 3, SMS: 2, PUSH: 10 }
function tryAcquire(userId, channel):
key = "ratelimit:" + userId + ":" + channel + ":" + currentHour()
count = redis.incr(key)
if count == 1: redis.expire(key, 2h) // 2h window covers the hour boundary
return count <= HOURLY_LIMITS[channel]
Priority Topic Separation
// Critical — OTP, order confirmations
// Dedicated consumer group, not shared with promotional traffic
topic: notifications.critical
partitions: 20 ← high parallelism for <5s delivery
retention: 24h
// Promotional — lower priority, processed off-peak
topic: notifications.promotional
partitions: 5
retention: 7 days
Trade-offs
| Decision | Option A | Option B | Winner & Why |
|---|---|---|---|
| Delivery guarantee | At-most-once | At-least-once | At-least-once + dedup — OTP must be delivered; dedup prevents double-send |
| Priority handling | Single topic, field-based routing | Separate topics per priority | Separate topics — dedicated consumer groups; promo traffic never delays a critical OTP |
| Rate limiting | Per-user per-hour | Global per-channel | Both — per-user prevents spam; global protects SMS gateway API rate limits |
| Failed notifications | Drop after N retries | Dead-letter queue + alert | DLQ always — OTP must never be silently dropped; alert ops when DLQ depth grows |
Interviewer Q&A
Q: A marketing campaign sends 50 emails to one user in an hour. How do you prevent this?
Rate limiter enforces max 3 emails/user/hour via Redis token bucket. Campaign Service pre-computes delivery schedule respecting rate limits before publishing to Kafka — instead of pushing 100M events simultaneously, it spreads them over hours. If a rate limit is hit, the notification is requeued with a delay, not dropped.
Q: How do you guarantee OTP delivery in under 5 seconds?
Dedicated
notifications.criticaltopic with 20 partitions and its own consumer group, completely separate from promotional traffic. Rate limiting is bypassed forCRITICALpriority. Retry at 2 seconds on failure, max 3 attempts before DLQ + ops alert. Multi-gateway SMS fallback: if the primary provider is slow, auto-route to a backup.
Q: How do you track whether a push notification was opened?
When the user taps the notification, the app fires a
notification.openedevent with thenotificationIdembedded in the push payload. We correlate and updateDeliveryRecordstatus fromSENT→OPENED. FCM/APNs also return receipt callbacks —DELIVEREDandUNREGISTEREDfor dead device tokens — which we consume via webhooks and record in status.
Senior vs Staff Bar
| Area | Senior SWE | Staff SWE |
|---|---|---|
| Reliability | Rate limit + retry | Full reliability budget: per-channel SLA, alert thresholds, DLQ runbook |
| Scale | Kafka consumers | Consumer lag monitoring, partition rebalancing, back-pressure handling |
| Personalisation | User prefs check | ML-based send-time optimisation — send when the user is most likely to engage |
| Compliance | Opt-out check | TCPA for SMS, CAN-SPAM for email, GDPR right-to-erasure |
| Observability | Status DB | Metrics: delivery rate per channel, p99 latency, bounce rate, DLQ depth |
Master Reference
The HLD Interview Formula
1. Requirements (2 min)
→ Functional: what it does
→ Non-functional: consistency model, scale numbers, latency, availability
→ Ask 2–3 clarifying questions before drawing anything
2. High-level architecture (5 min)
→ Boxes: client, API gateway, services, DBs, cache, event bus
→ Label the data flows
3. Deep dive on the hardest part (15 min)
→ Pick the most interesting component and go deep
→ Show pseudocode if it clarifies the logic
4. Trade-offs (5 min)
→ "I chose X over Y because..."
→ Acknowledge the downside of your choice
5. Failure modes (5 min)
→ "What happens when [service] goes down?"
→ "What happens at 10x load?"
Consistency Decisions — Never Get This Wrong
| Scenario | Model | Rationale |
|---|---|---|
| Checkout: reserve inventory | Strong | No overselling — ever |
| Checkout: payment processing | Strong | No double-charge — ever |
| Checkout: order confirmation | Strong | Order must exist before taking money |
| Product catalog / search | Eventual | 30s stale is fine for browsing |
| Cart contents | Eventual | Session-scoped; user can refresh |
| Fraud feature store | Eventual (1h) | Behaviour signals don't change in seconds |
| Notification delivery | At-least-once | Dedup at consumer; never drop an OTP |
| Watch history / resume | Eventual | 10s stale resume position is imperceptible |
Key Design Patterns
| Pattern | When to Use | Core Idea |
|---|---|---|
| Optimistic Locking | Concurrent writes to the same row | version field in entity; DB rejects stale writes |
| Outbox Pattern | Guarantee Kafka publish after DB write | Write to outbox table atomically; background worker publishes |
| Idempotency Key | Payment, checkout, notifications | Cache result keyed by UUID; return same response on retry |
| Circuit Breaker | Calls to external gateways | Fail fast when downstream is slow; restore automatically |
| Saga (Orchestration) | Multi-service transaction | Central orchestrator; compensating actions on each failure |
| Token Bucket | Rate limiting | Redis INCR + EXPIRE; atomic, distributed, sub-ms |
| Cache-Aside | Read-heavy, occasionally updated data | Check cache → miss → read DB → populate cache |
| Scheduled Job | TTL cleanup, reconciliation | Beats DB triggers — observable, testable, easy to monitor |
High Traffic Event Checklist
When the interviewer asks "How do you handle 10x or 100x peak load?":
- Pre-warm cache the night before — hot SKUs, user features, popular content segments
- Pre-scale read replicas, gateway connections, and Kafka consumer groups before the window opens
- Circuit breakers on all downstream calls — fail fast, don't cascade
- Load shedding —
503 + Retry-Afterheader beats a cascading failure - Feature flags — kill non-critical features (recommendations, personalisation) under extreme load
- Freeze deploys 48 hours before — canary deploy any changes well in advance
- War room — dedicated on-call team, pre-staged runbooks, clear escalation paths