Building Blocks

The components you reach for once NFRs tell you what the system needs. Each entry covers what it is, the problem it solves, which NFR triggers it, and a real system example.

Caching

Reduce latency and DB load by serving frequently read data from faster storage layers closer to the application.

Redis Cache

Stores hot data in memory for sub-millisecond reads. Eliminates repeated DB queries for the same data.

NFR triggers: Low latency, read-heavy scale.

Cache patterns — pick one per use case:

Pattern	How It Works	Use When
Cache-aside (lazy)	App checks cache first. On miss, reads DB, writes to cache.	General reads. Cache only what's actually requested.
Write-through	On every write, app writes to cache AND DB together.	Read-heavy data that must stay fresh (user profiles).
Write-behind (write-back)	App writes to cache only. Cache flushes to DB asynchronously.	Write-heavy. Risk: data loss if cache crashes before flush.
Read-through	Cache sits in front of DB. Cache fetches from DB on miss automatically.	Simplifies app code. Cache layer handles DB reads.

Cache eviction policies:

LRU (Least Recently Used) — evict the item not accessed for the longest time. Default for most use cases.
LFU (Least Frequently Used) — evict the item accessed least often. Better for skewed access patterns.
TTL (Time To Live) — expire entries after a fixed time. Good for data that goes stale (search results, session tokens).

Cache invalidation strategies:

Most production systems combine approaches — short TTL as a safety net plus active invalidation for critical data.

Strategy	How It Works	Best For
TTL expiry	Entry automatically expires after a fixed time. Simple, no coordination needed.	Data that can tolerate some staleness (search results, recommendation scores)
Write-through	App deletes or updates cache immediately when it writes to DB, in the same code path.	User profiles, inventory counts — data that must stay fresh
Event-driven	CDC (Debezium) reads DB WAL, publishes change events to a queue, cache invalidation consumer acts on them. Clean separation of concerns. Better than DB triggers (which are hidden, hard to debug, tightly coupled to DB).	Cross-service invalidation, complex dependency graphs
Tagged invalidation	Cache entries are tagged (e.g. `user:123:posts`). On update, invalidate all entries sharing that tag.	Complex dependency patterns — a post update that should invalidate feed caches, search caches, and profile caches
Versioned cache keys	Include a version number in the key (`event:123:v42`). On update, increment version in DB → new key (`event:123:v43`). Old entries become unreachable, not deleted. No race conditions — a late writer can't overwrite new data because the DB forces a new version. Old entries expire naturally via TTL.	Entity-level data (product pages, event details) where you need immediate consistency without invalidation broadcasts
Deleted items cache	Maintain a small fast cache of recently deleted/modified item IDs. When serving feeds or lists, filter against this set before returning results. Lets you serve mostly-correct cached data while the larger cache structures are invalidated in the background.	Content moderation, privacy changes, soft deletes

TTL tiering — driven by staleness requirements: Let your NFRs dictate TTLs. If requirements say "search results can be 30 seconds stale," that's your TTL. Static data (venue info, product descriptions) → long TTL (hours). Volatile data (seat availability, inventory counts) → short TTL (seconds). User profiles → medium TTL (5 minutes). Don't pick arbitrary TTLs — derive them from consistency requirements.

When NOT to cache:

Real-time collaborative systems — Google Docs, multiplayer games. Every change must be immediately visible to all users. Caching actively hurts here.
Strongly consistent financial data — account balances, inventory when overselling is catastrophic. Use caching carefully with aggressive invalidation or avoid entirely.
Write-heavy systems — if your read/write ratio is 1:1 or 2:1 (e.g. Uber driver location updates), cache hit rates will be too low to justify the complexity.
User-specific private data — private messages, personal preferences. Only one user ever requests them, so there's no cache hit rate benefit. Don't put these in a CDN.

Limits:

Single instance: up to ~100GB RAM (practical sweet spot: 10–64GB)
Throughput: ~100K–1M ops/sec
Max key or value size: 512MB
Max keys per instance: ~4 billion

Real example: Twitter precomputes home timelines into Redis. A tweet read is a Redis list lookup, not a DB join across followers.

Deep dive: Persistence modes, replication, Sorted Sets, Pub/Sub, Geo, and Distributed Locks are all covered in the Redis Study Guide.

Cache Stampede (Thundering Herd on Expiry)

When a popular cache entry expires, all concurrent requests see a cache miss simultaneously and race to rebuild it from the DB. At 100K reads/sec, that's 100K identical DB queries in the same instant — the DB collapses under a self-inflicted DDOS from your own application. This is distinct from a Redis crash (Scenario 1 in What-If) — the cache is healthy, but one entry's TTL expired.

Three mitigations:

1. Probabilistic early refresh — serve cached data while background-refreshing before TTL expires. As the entry ages, each request has a small increasing probability of triggering a background refresh. At 50 minutes into a 60-minute TTL: 1% chance of refresh. At 58 minutes: 10%. At 59 minutes: 20%. By expiry, the entry has already been refreshed. No stampede ever occurs. This is the cleanest solution for most cases.

2. Background refresh process — a dedicated job continuously refreshes critical cache entries before they expire. Homepage cache refreshes every 50 minutes on a 60-minute TTL. Guarantees the entry never goes cold. Cost: infrastructure complexity and wasted refreshes for entries nobody requested. Worth it for your highest-traffic keys.

3. Request coalescing — when multiple requests see the same cache miss, only one fetches from the DB; the rest wait for that result. The key insight: even with millions of users hitting the same key, your DB only receives N requests — one per app server doing the coalescing. Not a full solution (the rebuilding request still hits DB) but dramatically reduces the blast radius.

Cache miss for key "celebrity:123"
→ App server 1: acquires in-flight lock, fetches from DB, populates cache
→ App server 2: sees in-flight lock, waits for app server 1's result
→ Result shared to all waiters — DB hit exactly once per app server

Hot key fanout — for extreme loads where even request coalescing isn't enough: store identical copies under multiple keys (feed:taylor-swift:1, feed:taylor-swift:2, … feed:taylor-swift:10). Readers randomly pick one. Spreads 500K req/sec across 10 keys at 50K each. Tradeoff: memory × N copies, and invalidation must clear all replicas.

In an interview: Mention probabilistic early refresh or background refresh for write-through caches on critical data. Mention request coalescing when discussing how you'd handle a celebrity post going viral.

CDN (Content Delivery Network)

Caches static and cacheable content at edge nodes geographically close to users. Reduces origin load and cuts user-facing latency for global traffic.

NFR triggers: Low latency for global users, high scale reads of static content.

What it caches: Images, JS/CSS, video segments, and API responses with appropriate cache headers. Does not cache user-specific or private responses.

Cache hit: Served from edge node ~5–20ms from user. No origin request. Cache miss: Fetches from origin, caches the response for subsequent requests.

Pull CDN vs Push CDN:

Pull (default): Edge node fetches from origin on first request, caches the result. Simple — no pre-loading needed. Best for long-tail content where you don't know what will be requested.
Push: You proactively upload content to edge nodes before any user requests it. Best for content you know will spike immediately — live events, major video releases. Netflix pre-positions popular titles at edge nodes the night before a big release.

Cache headers — how the CDN knows what to cache and for how long:

Header	Purpose	Example
`Cache-Control: max-age=86400`	Cache for 24 hours	Static assets with versioned filenames
`Cache-Control: no-store`	Never cache	Auth endpoints, payment pages
`Cache-Control: public`	Any cache (CDN + browser) may store	Public images, fonts
`Cache-Control: private`	Browser only, not CDN	Personalised API responses
`ETag`	Fingerprint of the response	CDN sends `If-None-Match` on revalidation; origin returns 304 if unchanged
`Vary: Accept-Encoding`	Cache separately per encoding	Gzip vs Brotli responses cached as separate entries

Cache invalidation: CDN entries expire when max-age elapses. For immediate invalidation (emergency content removal, broken asset), issue a CDN invalidation API call on the affected path. This is expensive at scale — design around it by versioning asset filenames (app.v42.js) so the new version has a new URL and the old one expires naturally.

What not to cache:

Authenticated or user-specific responses — every user would get the same cached response
Responses without explicit cache headers — CDN behaviour is undefined
POST, PUT, DELETE — only GET and HEAD are cacheable

Real example: Netflix stores video chunks in CDN edge nodes. 95%+ of video traffic never hits Netflix origin servers.

Load Balancer

Distributes incoming traffic across multiple app servers so no single instance is overwhelmed. Single entry point to the fleet — clients always talk to the load balancer, never to app servers directly.

NFR triggers: High availability, horizontal scale, fault tolerance.

L4 vs L7:

	L4 (Transport)	L7 (Application)
Operates on	TCP/UDP packets	HTTP headers, URL, body
What it can do	Route by IP + port only	Route by path, hostname, headers; rewrite URLs; terminate TLS; inspect cookies
Speed	Faster — no payload inspection	Slightly slower — reads each request
Use when	Raw throughput, non-HTTP protocols	HTTP/HTTPS services, path-based routing (e.g. `/api` → API servers, `/static` → file servers)
Examples	AWS NLB, HAProxy TCP mode	AWS ALB, Nginx, Envoy

Routing algorithms:

Algorithm	How It Works	Use When
Round-robin	Requests go to each server in turn	Servers are identical, requests are similar in cost
Least connections	Next request goes to the server with fewest active connections	Requests vary in processing time (one slow request shouldn't monopolise a server)
IP hash	`hash(client_IP) % n` picks the server	Sticky sessions required — same client always hits the same server
Weighted	Servers assigned weights proportional to capacity	Fleet is heterogeneous (different instance sizes)

Health checks:

Active (probe): LB sends a periodic HTTP GET to /health. If it returns non-2xx or times out, the server is removed from rotation.
Passive: LB watches live traffic. If a server returns 5xx errors or times out, it's temporarily removed.

Most production LBs use both: active checks catch dead servers quickly; passive checks catch servers that respond to health probes but fail on real traffic.

Sticky sessions (session affinity): Useful when app servers hold session state (e.g. WebSocket connections). LB routes all requests from the same client to the same server using a cookie or IP hash. Drawback: uneven load if one client is heavy. Prefer stateless app servers with session state in Redis — then any server can handle any request.

Real example: AWS ALB terminates TLS, inspects the Host header, and routes /api/* to one target group and /* to another. Health checks run every 30s; unhealthy targets are removed within 60s.

Messaging & Queues

Decouple producers from consumers and absorb write bursts so a slow downstream never blocks or overwhelms the rest of the system.

Kafka (Message Queue / Event Stream)

Durable, distributed log of messages. Producers append to a topic. Consumers read at their own pace. Messages are retained (not deleted on consume).

NFR triggers: Write-heavy scale, async processing, fault tolerance, event replay.

Key concepts:

Topic — named stream of messages (e.g. user-events)
Partition — topics are split into partitions for parallelism. Messages in a partition are ordered.
Consumer group — multiple consumers share a topic, each partition assigned to one consumer. Scale consumers independently.
Offset — each message has an offset. Consumers track their position. Can replay from any offset.
Retention — messages kept for N days regardless of consumption. Enables replay and audit.

acks setting and durability:

acks=0 — fire and forget. Fastest. No guarantee.
acks=1 — leader confirms. Fast. Leader crash = message lost.
acks=all — all in-sync replicas confirm. Slowest (+2–5ms). No data loss.

Limits:

Throughput: millions of messages/sec per broker. LinkedIn processes 7 trillion messages/day across their cluster.
Message size: default 1MB, configurable up to 10MB+ (large messages reduce throughput significantly)
Partitions: practical limit ~4K per broker — more partitions = more memory and slower failover
Consumer lag: consumers can fall hours or days behind and catch up without message loss (within retention window)
Retention: configurable by time (days/weeks) or size (GB per partition)
Latency: typically 5–15ms end-to-end at acks=1. Up to 10–30ms at acks=all.

Exactly-once semantics (critical for financial systems):

Without exactly-once, a consumer crash after processing but before committing its offset causes reprocessing on restart — double payments. Kafka solves this with:

Idempotent producers: Producer assigns a sequence number to each message. Broker deduplicates retries. No duplicate writes even on retry.
Transactional API: Atomic read-process-write. Offset commit and output write happen in one transaction. Either both succeed or neither does.
Result: Each payment event processed exactly once — not duplicated, not lost.

Consumer groups per service (independent offsets):

Each service (reconciliation, webhook delivery, analytics) has its own consumer group with a unique group_id. This means every service receives all events independently — they don't compete. Each maintains its own offset (position in the log). Reconciliation service being slow doesn't affect webhook delivery.

Within a consumer group, each partition is assigned to exactly one consumer instance. Max parallelism = partition count. To scale: add consumer instances up to partition count. If a consumer fails, Kafka auto-rebalances partitions to remaining instances.

Real example: Uber publishes every GPS location update to Kafka. Multiple consumers (routing, surge pricing, analytics) read independently at their own pace. Stripe uses Kafka with consumer groups per service — webhook delivery, reconciliation, and analytics each consume the same payment events independently.

Real-time Communication

When the client needs to receive data pushed from the server (chat, notifications, live scores), you have four options. Pick based on latency needs, direction, and infrastructure cost.

Technique	Direction	Connection	Latency	Use When
Short Polling	Client → Server (repeated)	New HTTP request every N seconds	High (N seconds)	Simple. OK for non-urgent updates (email inbox refresh).
Long Polling	Client → Server (held open)	Client sends request, server holds it open until data is ready, then responds	Medium (~seconds)	Better than short polling. No persistent connection.
Server-Sent Events (SSE)	Server → Client (one-way)	Single persistent HTTP connection, server pushes events	Low	Notifications, live feeds. Simple to implement. No bidirectional needed.
WebSockets	Bidirectional	Persistent TCP connection, both sides can send anytime	Very low (~ms)	Chat, multiplayer gaming, collaborative editing, live trading.

Short Polling: Client polls every 5 seconds. Simple but wasteful — most requests return nothing.

Long Polling: Client sends request. Server holds it open (up to 30s). When data arrives, server responds. Client immediately sends the next request. Simulates push without a persistent connection.

SSE: HTTP connection stays open. Server pushes data: ... events whenever it wants. Client uses EventSource API. One-way only (server → client). Simpler than WebSockets for notifications.

SSE wire format — each event is plain text:

data: {"seat": "A12", "status": "reserved"}\n\n       ← basic event
id: 42\n\n                                             ← lets client send Last-Event-ID on reconnect
event: seat-update\n\n                                 ← named event type
retry: 3000\n\n                                        ← tell client to wait 3s before reconnecting

Two newlines (\n\n) end each event. EventSource auto-reconnects and sends Last-Event-ID so the server can resume from where the client left off.

SSE production considerations (when you control the stack):

Idle timeouts: Cloud load balancers (GCP ≈30s, AWS ALB ≈60s default) will kill quiet SSE connections. Fix: send periodic heartbeat comments from the server every ≈20s: ": keepalive\n\n". These are ignored by EventSource but keep the connection alive.
Proxy buffering: Proxies like Nginx buffer responses by default — they wait for the "full" response before forwarding, which breaks infinite SSE streams. Fix: proxy_buffering off in Nginx config. One-liner, not an ongoing concern once set. Only matters if you don't control the proxy (corporate proxies, CDN edge nodes on lower-tier plans).
HTTP/2: Removes the HTTP/1.1 browser limit of 6 connections per domain. Multiple SSE streams now multiplex over one connection — the "can't have multiple live streams" concern is gone on HTTP/2.
Mobile clients: OS may suspend the network connection when the app goes to background. Network switches (WiFi → LTE) also drop connections. SSE EventSource auto-reconnects, and Last-Event-ID handles resumability. Design your server to replay missed events by ID.
Pod restarts / scaling: Long-lived connections mean all clients on a restarting pod reconnect at once. Stagger restarts and implement reconnect jitter on the client to avoid thundering herd.

WebSockets: Full-duplex persistent TCP connection. Both sides send messages anytime. Higher infrastructure cost (stateful connections — load balancer must route same client to same server, or use Redis Pub/Sub to broadcast across servers).

SSE vs WebSocket — when SSE wins: SSE is plain HTTP — no protocol upgrade, works through most firewalls and proxies, EventSource handles reconnect automatically, simpler to implement and monitor. If you only need server → client push (notifications, live feeds, progress updates), SSE is the better default. Only reach for WebSockets when you need bidirectional messaging.

Limits:

Connections per server: ~10K–100K out of the box, but with OS tuning easily 1M+ on a single server
Memory per connection: ~2–8KB kernel overhead → 100K connections ≈ 200MB–800MB RAM — this is the real bottleneck, not port numbers
Scale-out: requires sticky sessions (same client always hits same server) or a fan-out layer (Redis Pub/Sub) so any server can push to any client

The 65,535 connection limit is a myth. That's the TCP port number range — not the per-port connection limit. Real limits are file descriptors (tunable via ulimit) and RAM (~2–8KB per connection). With OS tuning, servers routinely handle 1M+ concurrent connections.

Real examples:

Slack: WebSockets for real-time messages.
Twitter feed: SSE for new tweet notifications.
GitHub CI status page: SSE or long polling.
Stock ticker: WebSockets for live prices.
Ticketmaster seat map: SSE to push seat reservations in real-time — when any user reserves a seat, push an event to all viewers of that event's seat map so it goes grey instantly without a page refresh.

Redis Pub/Sub — Server-Side Fan-out

Publisher sends to a channel. All current subscribers receive it instantly. Fire-and-forget — no persistence, no queue.

NFR triggers: Real-time fan-out across multiple app servers, live presence, WebSocket broadcasting.

How it works:

PUBLISH  room:42  '{"user":"alice","msg":"hello"}'
SUBSCRIBE room:42   → all subscribers receive the message instantly

The key use case in real-time systems: WebSocket servers each subscribe to the relevant channel. When any user sends a message, publish once to Redis → every server that has subscribers on that channel receives it and pushes to their connected clients. No need to know which server holds which connection.

Limitation: Messages are lost if no subscriber is listening at that moment. Not durable. Use Kafka if you need delivery guarantees.

Redis Pub/Sub vs Kafka:

	Redis Pub/Sub	Kafka
Delivery guarantee	None — fire and forget	At-least-once, exactly-once
Message persistence	No	Yes (days/weeks)
Use when	Live presence, WebSocket fan-out	Payment events, audit logs, anything you can't afford to lose

Real example: Chat service with multiple WebSocket servers — each server subscribes to room:{roomId}. A message published once reaches all connected users regardless of which server they're on.

Mobile Push Notifications — APNs and FCM

The mandatory platform gateways for delivering notifications to iOS and Android devices. You cannot push directly to a phone — Apple and Google each run a persistent connection infrastructure to every device on their platform. Your server talks to their gateway; the gateway handles delivery.

NFR triggers: Real-time notifications, user engagement, background alerts.

The two gateways:

	APNs	FCM
Run by	Apple	Google (Firebase)
Covers	iOS, macOS, watchOS	Android, web push (Chrome, Firefox)
Auth	Device token + JWT or certificate	Registration token + service account key
Offline behaviour	Stores last notification, delivers on reconnect (TTL configurable)	Same — queues and delivers when device is reachable
Fan-out	One message per device	Topics: one message fans out to all subscribers

Flow:

Mobile app launches for first time
  → registers with APNs / FCM
  → receives device token / registration token
  → app sends token to your backend → stored against user

Your Notification Service receives trigger (ticket confirmed, seat available):
  → look up user's device token + platform
  → iOS:     POST to api.push.apple.com  with token + payload
  → Android: POST to fcm.googleapis.com  with token + payload
  → gateway delivers to device (immediately if online, queued if offline)

You never talk to the device directly. The token is the address.

Payload:

{
  "title": "Your tickets are confirmed",
  "body": "Taylor Swift · Madison Square Garden · June 14",
  "data": { "eventId": "123", "screen": "booking_confirmation" }
}

data is a silent background payload — app acts on it (update badge count, prefetch content) without showing a visible notification.

Scaling — decouple from request path:

At high volume (millions of notifications per event), never call APNs/FCM synchronously inline. Publish notification jobs to a queue and have worker pools call the gateways asynchronously:

Ticket confirmed → Notification Service publishes to Kafka "notifications" topic
Worker pool reads from topic → calls APNs / FCM in parallel
→ log delivery status per notification

This decouples your app from gateway latency and handles burst fan-outs gracefully. APNs supports HTTP/2 connection pooling — one persistent connection multiplexes thousands of notifications rather than opening a new connection per message.

Abstraction options — you rarely call the raw APIs directly:

firebase-admin SDK — official Google SDK, handles FCM for any backend language
AWS SNS — single API that routes to APNs, FCM, or SMS based on platform. Removes the need to manage two gateways separately.
OneSignal / Twilio Notify — managed services that abstract both gateways, add analytics, segmentation, and retry logic

In an interview: "Device tokens are registered on first app launch and stored in our User Service. Our Notification Service publishes jobs to Kafka; worker pools consume and call APNs for iOS and FCM for Android. We never block the booking request path on notification delivery — it's fully async."

Geospatial

Techniques for storing and querying location data at scale — proximity search, radius queries, and indexing millions of moving entities.

Why you can't use a regular database for high-frequency location updates:

Two problems: write volume (10M drivers × every 5s = 2M writes/sec, far beyond what a single DB handles) and B-tree indexes don't support 2D radius queries — a lat/lng range query requires two separate scans and degrades to a full table scan for true proximity queries.

Three approaches in order of quality:

Approach	How	Problem
Direct DB writes	Write every update to DB, query with lat/lng columns	Write overload + B-tree can't do radius queries efficiently
Batch processing	Aggregate updates over N seconds, write to DB in batches	Reduces write load but location data is always N seconds stale — suboptimal matches
Redis Geo (best)	Write to Redis in-memory geo store, query with GEOSEARCH	Real-time, handles 2M writes/sec, sub-ms proximity queries

Geohashing

Encodes a (latitude, longitude) pair into a short string. Nearby locations share a common prefix.

NFR triggers: Proximity search, location-based queries.

How it works: The world is recursively divided into a grid. Each cell gets a string like u4pruydqqv. The longer the string, the smaller the cell. Nearby points share prefixes: u4pru and u4prv are neighbors.

Use in a system: Store geohash as a DB column. To find nearby items, query for all rows whose geohash starts with the same prefix. Fast with a string index.

Limitation: Edge cases at grid boundaries — two points physically close can have different prefixes if they're on opposite sides of a cell boundary. Fix by also querying the 8 neighboring cells.

Precision levels:

Length	Cell size
4	~20 km
5	~2.4 km
6	~610 m
7	~76 m
8	~19 m

Real example: Uber stores driver locations as geohashes. Finding nearby drivers = prefix match on geohash string.

Redis Geo

Redis sorted set where members are stored by a 52-bit geohash score. Enables radius and distance queries in O(n + log m). The best solution for high-frequency location updates at scale.

Commands:

GEOADD drivers 13.361 38.115 "driver:42" — add or update location (overwrites previous position for same member — you always have the latest)
GEODIST drivers driver:42 driver:99 km — distance between two members
GEOSEARCH drivers FROMMEMBER driver:42 BYRADIUS 5 km ASC — find members within radius, sorted by distance (Redis 6.2+ — preferred over deprecated GEORADIUS)
GEOSEARCH drivers FROMLONLAT 15.0 37.0 BYRADIUS 5 km ASC COUNT 10 — find nearest 10 members from a coordinate
GEOSEARCH drivers FROMLONLAT 15.0 37.0 BYBOX 10 10 km ASC — find members within a bounding box

GEOSEARCH (Redis 6.2+) replaces the older GEORADIUS and GEORADIUSBYMEMBER commands. It supports both radius and bounding box queries, and can search from either a fixed coordinate or an existing member.

Precision levels (geohash string length):

Precision	Cell size	Use for
4 chars	~20 km	City-level search
5 chars	~2.4 km	Neighborhood
6 chars	~610 m	Street-level
8 chars	~19 m	Building-level

Redis as the live location store — the full pattern:

For a ride-sharing system with 10M drivers updating every 5 seconds, Redis is the primary store for live positions. The DB stores historical trip data, not current locations.

Driver app sends GPS update every 5 seconds
  → Location Service: GEOADD drivers <lng> <lat> "driver:42"
  → Also: ZADD driver_timestamps <unix_timestamp> "driver:42"

Rider requests a match:
  → GEOSEARCH drivers FROMLONLAT <rider_lng> <rider_lat> BYRADIUS 5 km ASC COUNT 20
  → Returns nearest 20 active drivers instantly

Stale driver cleanup (runs every 30 seconds):
  → ZRANGEBYSCORE driver_timestamps 0 <now - 30s>  ← drivers not updated in 30s
  → ZREM driver_timestamps <stale_drivers>
  → ZREM drivers <stale_drivers>

The companion driver_timestamps sorted set (score = last update timestamp) lets you efficiently find and remove offline drivers. Without it, offline drivers stay in the geo set indefinitely and appear available to riders.

Durability tradeoff — location data is ephemeral by nature:

Redis is in-memory. If it crashes, live location data is lost. For most data this is unacceptable — but driver locations are special: every driver re-reports their position within 5 seconds. A Redis crash means 5 seconds of stale data before the geo set is fully rebuilt. This is an acceptable tradeoff explicitly worth stating in an interview.

Mitigation options if you need stronger guarantees:

RDB snapshots — Redis dumps memory to disk every N seconds. Fast recovery, small data loss window.
AOF (Append-Only File) — every write logged to disk. Near-zero data loss. Higher I/O overhead.
Redis Sentinel — automatic failover for a primary + replica setup. If primary dies, Sentinel promotes a replica to primary in seconds. Location data on the replica is at most milliseconds stale.
Redis Cluster — shards data across nodes. Each shard has its own primary + replicas. For 10M drivers you'd shard by city or geohash prefix to distribute load.

Redis Sentinel vs Redis Cluster:

	Redis Sentinel	Redis Cluster
Purpose	High availability — automatic failover	Horizontal scale — sharding across nodes
Data distribution	All data on one primary (replicated)	Data split across multiple primaries
Use when	Dataset fits on one node, need HA	Dataset too large for one node OR write throughput exceeds single node
Failover	Sentinel promotes replica automatically (~seconds)	Built-in, per shard

Limits:

Built on sorted sets — same ~4.3 billion member limit per key
Coordinate precision: ~0.6mm at equator (52-bit geohash internally)
Radius query: O(n + log m) — fast up to millions of members
For 10M drivers: split into multiple geo keys by city/region to keep each key manageable

Write optimization patterns for high-frequency location updates:

Technique	How	When
Multi-member GEOADD	Batch multiple drivers in one command — one round trip for N updates	Always — zero tradeoff
Write coalescing	Deduplicate by driver ID before writing — if 3 updates arrive in 200ms, only the latest matters	Always — reduces Redis writes without staleness
Kafka as write buffer	Location Service → Kafka → consumer batches and flushes to Redis every 100ms	Rush hour spikes where update volume surges 5–10×
Region-based sharding	Separate geo keys per city (`drivers:nyc`, `drivers:la`) — each on a different Redis node	Multi-city scale-out
Client-side batching	App sends every 2–3s instead of every 1s	Last resort — introduces positional staleness

In an interview: "Start with multi-member GEOADD and write coalescing — zero tradeoff. For rush hour spikes, add Kafka as a write buffer. Region sharding handles scale-out across cities."

Real example: Tinder matches, Uber driver dispatch, Yelp restaurant search — all can use Redis Geo for "find X near me" queries.

Quadtree

Recursively divides 2D space into four quadrants. Each node subdivides until a cell has ≤ N points. Used when you need dynamic spatial indexing with fast insert/delete.

NFR triggers: Proximity queries with frequently updating locations (moving drivers, users).

How it works: Start with the whole world as the root. If a region has more than N points, split it into 4 quadrants. Recurse. To find nearby points, traverse down to the relevant quadrant.

vs Geohash: Quadtree adapts density (dense city = smaller cells, sparse desert = large cells). Geohash uses fixed-size grid cells.

Real example: Google Maps, game engines for collision detection, location services with millions of moving entities.

PostGIS — PostgreSQL Geospatial Extension

If you're already on PostgreSQL and your location write volume is manageable (< 50K writes/sec), PostGIS gives you proper geospatial queries without running a separate Redis cluster.

PostGIS adds geometry/geography data types and spatial functions to PostgreSQL. Under the hood it uses GiST indexes (not B-tree) which are designed for multi-dimensional data and support true radius queries efficiently.

-- Add a PostGIS geography column (uses sphere math — accurate for real distances)
ALTER TABLE drivers ADD COLUMN location geography(POINT, 4326);

-- Update driver position
UPDATE drivers SET location = ST_MakePoint(-74.006, 40.7128)
WHERE driver_id = 42;

-- GiST index makes proximity queries fast
CREATE INDEX idx_drivers_location ON drivers USING GIST(location);

-- Find all drivers within 5km of a rider — uses GiST, not full table scan
SELECT driver_id, ST_Distance(location, ST_MakePoint(-73.99, 40.72)::geography) AS dist_m
FROM drivers
WHERE ST_DWithin(location, ST_MakePoint(-73.99, 40.72)::geography, 5000)
ORDER BY dist_m
LIMIT 20;

ST_DWithin is the key function — it uses the GiST index to eliminate most rows before calculating exact distances.

When PostGIS is enough vs when you need Redis Geo:

	PostGIS	Redis Geo
Write throughput	~5–50K writes/sec	Millions/sec
Query latency	10–50ms	< 1ms
Durability	Full ACID	Configurable (RDB/AOF)
Operational cost	Zero (already on Postgres)	Separate Redis cluster
Use when	Moderate scale, need durability, location is part of complex query with joins	High-frequency updates (ride-sharing, IoT), need sub-ms proximity queries

For Uber-scale (10M drivers, 2M writes/sec): Redis Geo. For Yelp-scale (restaurant search, low write frequency): PostGIS is sufficient and simpler.

Database Techniques

Patterns for scaling read throughput, managing connections under load, splitting data across nodes, and recovering from failures.

Database Indexing

An index is a sorted lookup structure that lets the DB jump directly to matching rows instead of scanning the whole table. Without one, every query is a full table scan — O(n). With one, it's O(log n). At 10 million rows, that's the difference between checking 20 index entries vs reading 2GB of data.

Index types:

Type	How It Works	Best For
B-tree (default)	Balanced tree — supports equality, range, sort, prefix	General purpose — most queries
Hash	Hash of the key → row pointer	Exact-match lookups only (`WHERE email = '...'`). No ranges.
Full-text	Inverted index of words	Text search (`LIKE '%word%'` won't use B-tree)
GiST / GIN (Postgres)	Generalized for complex types	Geo queries, JSONB, arrays

Compound indexes — column order matters:

-- Index on (status, created_at)
-- Helps: WHERE status = 'active'
-- Helps: WHERE status = 'active' AND created_at > '2024-01-01'
-- Doesn't help: WHERE created_at > '2024-01-01' (leading column missing)

Put the most selective column first. Queries filtering only on trailing columns can't use the index.

EXPLAIN to verify index use:

EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
-- Seq Scan on users → no index, full table scan
CREATE INDEX idx_users_email ON users(email);
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
-- Index Scan using idx_users_email → O(log n) lookup

In an interview: When sketching your schema, state which columns you'd index. "I'd index user_id on the orders table since most queries filter by user, and add a compound index on (event_id, status) for the seat availability query." Under-indexing kills far more production systems than over-indexing. Index maintenance overhead on writes is real but rarely the bottleneck.

Redis Sorted Sets — In-Memory Ranked Queries

A sorted set where every member has a numeric score. When your ranked data needs sub-millisecond reads and a relational DB would require expensive ORDER BY + LIMIT queries at scale, reach for Redis Sorted Sets.

NFR triggers: Low latency ranked queries, real-time leaderboards.

Commands: ZADD leaderboard 9500 "user:42", ZREVRANGE leaderboard 0 9, ZRANK leaderboard "user:42"

Limits: ~4.3 billion members per key. All operations O(log n). At 10M members: ~500MB–1GB RAM.

Other uses: Priority queues (score = priority), expiring job queues (score = execute-at timestamp, worker polls with ZRANGEBYSCORE jobs 0 <now>), rate limiting (score = request timestamp, count members in a sliding window).

Real example: Gaming leaderboard — score is points, member is user ID. Top 10 is a single ZREVRANGE call regardless of total player count.

Denormalization and Materialized Views

Normalization splits data across tables to eliminate redundancy. This is clean for writes but expensive for reads — every query needs joins across multiple tables. Denormalization reverses this: store redundant data in one place to make reads a single lookup.

When to denormalize: Read/write ratio above ~10:1. The extra write complexity (update multiple places when a user changes their name) is justified by eliminating expensive joins at read time.

-- Normalized: join 3 tables on every order page load
SELECT u.name, o.order_date, p.product_name, p.price
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON oi.product_id = p.id
WHERE o.id = 12345;

-- Denormalized: single-table read from order_summary
SELECT user_name, order_date, product_name, price
FROM order_summary
WHERE order_id = 12345;

Tradeoff: if a user changes their name, you must update every row in order_summary that references them. Rare update + frequent read = worth it. Frequent update + rare read = don't denormalize.

Materialized views take this further by precomputing expensive aggregations and storing the results:

-- Instead of recomputing on every page load:
CREATE MATERIALIZED VIEW product_ratings AS
SELECT product_id, AVG(rating) as avg_rating, COUNT(*) as review_count
FROM reviews GROUP BY product_id;

-- Refresh periodically (background job, or on write via trigger)
REFRESH MATERIALIZED VIEW product_ratings;

Product rating pages now read one precomputed row instead of aggregating millions of reviews. Refresh frequency is your staleness/freshness knob. PostgreSQL supports REFRESH MATERIALIZED VIEW CONCURRENTLY — reads still work during refresh.

In an interview: Mention denormalization when you have a read-heavy query that requires joining multiple large tables. Mention materialized views for aggregation-heavy reads (dashboards, leaderboards, analytics summaries).

Read Replicas

Copies of the primary database that handle read queries. Primary handles all writes and replicates changes to replicas. All writes go to the primary (leader); reads distribute across replicas (followers).

NFR triggers: Read-heavy scale, high availability.

When to add replicas: A well-indexed single DB handles ~50,000–100,000 reads/sec on modern hardware. Above that threshold — or when read load is affecting write performance — add read replicas. This is a rough estimate; actual numbers depend on query complexity, data size, and hardware.

Synchronous vs asynchronous replication:

	Synchronous	Asynchronous
How it works	Primary waits for replica to confirm before acking the write	Primary acks immediately, replication happens in background
Replication lag	Zero — replica always current	10ms–seconds depending on network and load
Write latency	+5–20ms	No impact
Data loss on primary crash	Zero (replica has everything)	Seconds of writes lost (RPO = seconds)
Use when	Financial systems, anything where stale reads are unacceptable	Social feeds, content platforms where slight staleness is fine

Replication lag and read-your-writes: With async replicas, a user who just posted something might not see their own post if their next read goes to a lagging replica. Fix: route a user's own reads to the primary for a short window after their writes (sticky routing). Route all other reads to replicas.

Replicas also provide redundancy: If the primary fails, promote a replica to become the new primary — minimizing downtime. This is the same failover mechanism described in What-If Scenario 2.

Limits:

PostgreSQL: supports hundreds of replicas technically — practical limit 5–10 before replication overhead stresses the primary
Replication lag: < 100ms same AZ (async), seconds cross-region
Each replica is a full copy of the entire dataset — scales reads linearly, does nothing for write throughput or dataset size. If the dataset is too large for fast queries, add sharding or caching instead.

Real example: Instagram routes feed reads to replicas, profile writes to primary. Replica lag is acceptable for feed freshness.

CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries) entirely. Instead of one model that handles both, writes go to a normalized, consistent store and reads are served from a separate, denormalized store optimized purely for fast retrieval.

NFR triggers: Write-heavy systems where reads and writes compete for resources, high read latency on complex joins, or when the read shape differs significantly from the write shape.

How it works:

Client → Write API → Command Handler → Primary DB (normalized, consistent)
                                     ↓ event / async replication
Client → Read API  → Query Handler  → Read Store (denormalized, precomputed)

The read store can be a Redis cache, Elasticsearch index, a materialized view, or a read replica with a different schema — whatever makes reads fast. It is updated asynchronously after each write.

Relation to what you've already seen:

Read replicas separate read and write traffic at the DB layer — same schema, different nodes
Denormalization separates read and write data shape in one DB — one model optimized for reads
CQRS separates both: different models, different stores, different APIs end-to-end

When to use:

Write-heavy systems where read queries are slowing down writes (contention on the same tables)
Read shape is very different from write shape — e.g., you write normalized orders but serve a precomputed order-summary view
You need to scale reads and writes independently

When NOT to use:

Simple CRUD apps — the added complexity rarely pays off below ~10M DAU
Strong consistency required across read and write — CQRS read stores are eventually consistent

Real example: Twitter writes tweets to a primary DB, then fans out asynchronously to per-user feed caches (Redis). The write model is normalized; the read model is a precomputed list. Classic CQRS.

Database Proxy / Connection Pooler

Sits between your app and the database. Maintains a small pool of real DB connections and multiplexes many app connections through them. No application code changes — just point the connection string at the proxy instead of the DB directly.

NFR triggers: Scale (connection exhaustion), fault tolerance (queues requests during DB failover).

How it works:

500 app threads → Proxy → 20–100 real connections → Database

App thinks it's talking to the DB directly. The proxy manages the real connection pool. During DB primary failover, the proxy queues incoming requests for a few seconds then reconnects to the promoted replica — the app never knows the primary changed.

Pooling modes:

Transaction pooling (most common) — connection returned to pool after each transaction. Most efficient. 500 app threads → 20 real DB connections.
Session pooling — one real connection per client session. Less efficient but compatible with all DB features.

Tools by database:

Database	Tool	Notes
PostgreSQL	PgBouncer	Lightweight, connection pooling only. Most common.
PostgreSQL	Pgpool-II	Heavier — adds load balancing and failover on top of pooling.
MySQL / MariaDB	ProxySQL	Connection pooling + query routing + read/write splitting.
MySQL / MariaDB	MySQL Router	Official MySQL proxy. Part of InnoDB Cluster. Handles failover.
MySQL at scale	Vitess	Built by YouTube. Adds horizontal sharding on top of pooling. CNCF project.
Any (AWS managed)	RDS Proxy	Supports PostgreSQL, MySQL, MariaDB, SQL Server. Fully managed, no ops overhead.

Limits:

Reduces DB connections from thousands to tens — prevents connection exhaustion at scale
Adds ~0.1–0.5ms overhead per query
During failover: queues requests for ~5–10 seconds then reconnects transparently

What about smart DB clients?

Modern clients can also detect primary failure and reconnect — but with caveats:

Aurora cluster endpoint — Aurora updates the endpoint to the new writer automatically. Client reconnects to the same URL. Cleanest solution, no proxy needed for failover.
MySQL Connector/J — provide multiple hosts, client tries each on failure. Also supports automatic read/write splitting.
PostgreSQL clients (JDBC, psycopg2) — support multiple hosts but detection relies on connection timeout (≈30s default unless tuned). Slower than a proxy (≈5s).

Client failover handles the routing problem but not the connection count problem — 10 app servers × 50 client-side connections each = 500 DB connections regardless. A proxy collapses that to 20. Both are often used together: client for failover awareness, proxy for connection pooling.

Real example: Almost every high-scale SQL deployment uses a connection pooler. Without it, 500 app servers × 20 threads = 10,000 DB connections — most SQL databases struggle above ~500–1,000 concurrent connections.

Database Sharding

Splits data across multiple DB nodes by a shard key. Each node owns a subset of the data.

NFR triggers: Write-heavy scale beyond what a single DB can handle (~5K–10K writes/sec for PostgreSQL).

Shard key choice matters:

By user ID: Common. All data for a user lives on one shard. Good for user-scoped queries. Risk: hot shards if some users are far more active.
By geography: US users on US shard, EU on EU shard. Good for data residency compliance.
By hash: Distributes evenly but makes range queries hard.

Consistent hashing: Maps both data keys and nodes onto a hash ring. When a node is added/removed, only the adjacent keys are remapped — not the entire dataset. Minimizes resharding cost.

Cross-shard queries are expensive. Design your shard key so most queries hit one shard. Avoid joins across shards.

Limits & when to shard:

Single PostgreSQL/MySQL: ~5K–10K writes/sec and ~1–2TB data before sharding becomes necessary
Each shard is a full independent DB — write throughput scales linearly with shard count
Resharding (changing shard count) is painful — consistent hashing minimizes but doesn't eliminate this

Real example: Discord shards message data by channel ID. All messages in a channel live on one shard.

Time-Series Database

Optimized for append-only writes of timestamped data. Efficient range queries and aggregations over time windows.

NFR triggers: Metrics, monitoring, IoT sensor data, financial tick data.

Why not regular DB: Time-series data is always appended (never updated), queried by time range, and often aggregated (avg, max, sum over 5-minute windows). Regular DBs aren't optimized for this pattern.

Real example: Prometheus stores metrics as time-series. Grafana queries it for dashboards. Influx DB for IoT sensor readings.

Bloom Filter

Probabilistic data structure that answers "is this item in the set?" in O(1) with zero false negatives and a small false positive rate.

NFR triggers: Avoid expensive DB lookups for items that almost certainly don't exist.

How it works: Uses multiple hash functions and a bit array. add(x) sets several bits. contains(x) checks those bits — if any are 0, x is definitely not in the set. If all are 1, x is probably in the set (small chance of false positive).

Limits & sizing:

10 bits per element → ~1% false positive rate (rule of thumb)
1 billion elements at 1% FPR: ~1.2GB RAM
1 billion elements at 0.1% FPR: ~1.8GB RAM
Cannot delete elements — use a Counting Bloom Filter if deletes are needed
Zero false negatives — if it says "not in set", it's definitely not there

Real example: Google Chrome uses a Bloom filter to check if a URL is in the safe browsing list before hitting the network. Cassandra uses one to skip SSTable lookups for keys that don't exist.

Search

Strategies for full-text search — from staying in your existing database to running a dedicated search engine.

PostgreSQL Full-Text Search

Built into PostgreSQL with no extra infrastructure. Uses tsvector (a preprocessed document representation) and tsquery (a search query), indexed with GIN for fast lookup.

How it works:

-- Add a tsvector column (or use an expression index)
ALTER TABLE events ADD COLUMN search_vector tsvector;

-- Populate it (to_tsvector handles tokenization, stemming, stop words)
UPDATE events SET search_vector =
  to_tsvector('english', coalesce(name, '') || ' ' || coalesce(description, ''));

-- GIN index makes queries O(log n) instead of full table scan
CREATE INDEX idx_events_search ON events USING GIN(search_vector);

-- Query: much faster than LIKE '%Taylor%'
SELECT * FROM events WHERE search_vector @@ to_tsquery('english', 'Taylor');

vs LIKE: LIKE '%Taylor%' forces a full table scan — O(n). tsvector + GIN is O(log n). At 10M rows, that's the difference between seconds and milliseconds.

Limitations:

No fuzzy search (typo tolerance) — Tayler won't match Taylor
Relevance ranking is basic — no machine learning, no popularity signals
Slower to update than Elasticsearch — index must be rebuilt or kept in sync on writes
Comfortable up to ~10M documents; degrades at higher volumes

When it's enough: Event search for moderately sized datasets where exact-word matching is acceptable and you don't want to run separate infrastructure.

Elasticsearch

A distributed search engine built on Apache Lucene. Excels at full-text search, fuzzy matching, faceted filtering, and aggregations at massive scale.

Core concept — inverted index:

Instead of storing documents and scanning them, Elasticsearch builds a map of every word → which documents contain it:

"Taylor" → [event:123, event:456, event:789]
"Swift"  → [event:123, event:890]
"Concert" → [event:123, event:456, event:201]

A search for "Taylor Swift Concert" intersects these posting lists in microseconds — no scanning required. This is fundamentally different from a B-tree index, which works on column values, not free text.

Fuzzy search: Elasticsearch uses edit distance (Levenshtein) to match near-spellings. Tayler → matches Taylor because they differ by one character. Configurable: fuzziness: 1 allows one edit, fuzziness: 2 allows two. SQL databases cannot do this without custom extensions.

Built-in caching:

Shard-level filter cache — caches results of filter queries (e.g. status = 'active', date ranges). Reused across different queries that share the same filter.
Request cache — caches full search responses at the shard level for aggregation-heavy queries. Particularly valuable for dashboards and analytics. Automatically invalidated when the underlying shard data changes.

These reduce query processing load on the cluster for repeated or similar queries — you get this without writing any caching code yourself.

Syncing PostgreSQL → Elasticsearch:

Approach	How	Risk
Dual write	App writes to DB and ES in the same request	Partial failure — DB succeeds, ES fails → silent divergence. Not recommended.
Scheduled batch	Cron job queries DB for changes, bulk-indexes to ES	High latency (minutes), misses hard deletes cleanly
CDC (Debezium)	Reads PostgreSQL WAL, publishes change events, ES consumer indexes them	Best choice. Captures ALL changes — including from DB migrations, admin tools, and other services outside your app code. Near real-time (~seconds lag).
CDC + Kafka	WAL → Debezium → Kafka → ES consumer	Adds durability and replay capability. ES consumer can fall behind and catch up. Worth adding if Kafka is already in your stack.

CDC is the best default because it's the only approach that captures changes originating outside your application code. If a DB migration bulk-updates event descriptions, CDC picks it up; dual-write and batch sync depend on your app being the only writer.

Search result caching on top of Elasticsearch:

For frequently repeated queries (e.g. search:keyword=Taylor+Swift&date=2024), cache results in Redis:

key:   "search:keyword=Taylor Swift&start=2024-01-01&end=2024-12-31"
value: [event1, event2, event3]
TTL:   86400 (24 hours)

Build the cache key from all query parameters so different queries get different cache entries. Invalidation is the hard part — search result sets don't map neatly to a single entity, so tag-based invalidation or a moderate TTL (hours, not days) is the pragmatic choice.

CDN for non-personalized search: If search results are the same for all users (public event search, no personalization), cache API responses at the CDN edge. A Tokyo user gets results from a Tokyo edge node, not your Virginia origin. Only valid for non-personalized queries — personal recommendations or saved searches should never be CDN-cached.

Operational cost: Elasticsearch runs on JVM, requires a cluster (3+ nodes recommended), and has significant operational overhead. Worth it for complex search, faceted filtering, and log analytics at scale. For simpler use cases, PostgreSQL FTS or Typesense/Meilisearch are easier to run.

Reliability Patterns

Components and strategies that keep the system running — or recovering fast — when individual parts fail.

Circuit Breaker

Stops calling a failing downstream service and returns immediately, giving it time to recover instead of piling on more load.

NFR triggers: Fault tolerance, high availability.

States:

Closed (normal) — requests pass through. Failures are counted.
Open (tripped) — too many failures. Skip the call entirely, return fallback immediately (< 1ms). Wait N seconds before probing.
Half-open (probe) — let one request through. Success → close. Failure → reopen.

Why it matters — cascade failure without circuit breaker:

Recommendation Service slow
    → Homepage threads all block waiting (30s timeout each)
    → Thread pool exhausts
    → Homepage Service crashes
    → Gateway times out waiting for Homepage
    → Total outage — caused by a non-critical service being slow

With circuit breaker: trips after N failures → fast-fail in < 1ms → fallback to generic recommendations → Homepage stays healthy.

What the user sees:

Without circuit breaker: spinner for 30 seconds → 500 error. DB never recovers because load keeps hammering it.
With circuit breaker + no fallback: fast 503 in milliseconds. DB recovers in ~30s.
With circuit breaker + fallback (L1 cache, stale data): user sees slightly stale content. Nobody notices.

Implementation Option 1 — Application code (Resilience4j, pybreaker, gobreaker):

CircuitBreaker breaker = CircuitBreaker.custom()
    .failureRateThreshold(50)      // open if 50% of calls fail
    .waitDurationInOpenState(30s)  // stay open 30 seconds
    .slidingWindowSize(10)         // measure last 10 calls
    .build();

User user = breaker.executeSupplier(() ->
    database.query("SELECT * FROM users WHERE id = ?", userId)
);

Advantage: business-logic aware. Can circuit break only writes but not reads. Can return a specific fallback value (L1 cache, default response) instead of an error.

Implementation Option 2 — Service mesh (Istio + Envoy), no application code:

outlierDetection:
  consecutiveErrors: 5       # trip after 5 consecutive failures
  interval: 30s              # measured over 30 seconds
  baseEjectionTime: 30s      # eject the host for 30 seconds

Sidecar proxy (Envoy) is injected next to every service and intercepts all traffic. App code is completely unaware. Coarser — trips on connection failures and timeouts, but can't return a business-level fallback.

Implementation Option 3 — Infrastructure layer:

Option	What It Does	Limitation
API Gateway (Kong, AWS API GW)	Circuit breaks external traffic at gateway	Only covers external calls, not internal service-to-service
Nginx / HAProxy upstream	Stops routing to dead backends	Detects dead hosts, not slow/struggling ones
RDS Proxy (AWS)	Pools DB connections, queues during failover	Solves connection exhaustion, not a full circuit breaker

Best practice: Use both — service mesh for service-to-service protection, application-level for nuanced fallbacks ("return L1 cache when DB circuit is open").

Real example: Netflix Hystrix (now Resilience4j) wraps every downstream call. If the recommendations service goes down, the circuit opens and Netflix shows a generic list instead of erroring.

Rate Limiter

Controls how many requests a client can make in a time window. Protects the system from abuse and overload.

NFR triggers: Security, fault tolerance, scale.

Algorithms:

Algorithm	How It Works	Characteristic
Token Bucket	Bucket refills at fixed rate. Each request consumes a token.	Allows bursts up to bucket size. Smooth average rate.
Leaky Bucket	Requests enter a queue (bucket). Processed at fixed rate. Overflow dropped.	Strict constant output rate. No bursts allowed.
Fixed Window	Count requests per time window (e.g. per minute). Reset at window boundary.	Simple. Edge case: 2× traffic at window boundaries.
Sliding Window	Rolling count over last N seconds. Smooths boundary problem.	Accurate. Slightly more memory.

Where to implement: API Gateway (per user/IP before reaching app servers).

Redis implementation (Fixed Window): INCR is atomic — two clients can't interleave. This is the canonical Redis rate limit pattern:

INCR user:42:requests          // atomic increment
EXPIRE user:42:requests 60     // set TTL on first call (60-second window)
// if value > limit → reject request

For sliding window, use a Redis Sorted Set: store each request as a member with timestamp as score, then ZREMRANGEBYSCORE to drop entries outside the window, ZCARD to count remaining.

Distributed implementation: A single app server maintaining its own counter allows N × limit total requests when N servers are running. Fix: all app servers share one Redis counter. The INCR pattern above works across any number of app servers because Redis is the single source of truth — no per-server state.

Real example: Twitter API limits 300 tweets per 3 hours per user. Stripe rate-limits API keys by requests/second.

Load Shedding

Intentionally rejects low-priority requests when the system is overloaded so high-priority work never fails.

Different from rate limiting, which caps per-client volume. Load shedding is a global capacity decision: the system is full, non-critical work goes first.

NFR triggers: Fault tolerance, availability, scale.

The priority ladder — define this before peak traffic, not during it:

Priority	Examples	Policy
P0 — never shed	Checkout, payment, OTP, auth	Always fast-path — never touch
P1 — shed last	Inventory check, search, cart	Queue; shed only under extreme load
P2 — shed first	Recommendations, personalization, analytics	Drop immediately on overload

Overload detection — use both signals together:

Queue depth exceeds 80% of max capacity
p99 latency exceeds SLA threshold (e.g. 2000ms)

At 80% queue depth you still have headroom. Waiting for 100% means you're already in trouble.

Bounded request queue — hold overflow before shedding: Rather than immediately 503ing when threads are busy, queue the request briefly with a TTL:

Request arrives → thread free? → process immediately
                → thread busy → queue (max 1000, TTL 30s, ordered by priority)
                → queue full  → 503 + Retry-After

Priority queue (min-heap) ordered by P0/P1/P2, then arrival time. When a thread frees, it always picks the highest-priority waiting request. Requests that expire in queue are dropped — processing a 30-second-old request returns a response to nobody.

P2 fallback — don't queue, return a default instantly:

if (isOverloaded()) {
    return Collections.emptyList();  // UI shows nothing, no crash
}
return recommendationService.fetch(userId);

This frees CPU immediately — the request is never processed at all.

Always include Retry-After on every 503:

HTTP 503 Service Unavailable
Retry-After: 10

Without it, every client retries immediately → thundering herd → traffic spike at exactly the wrong moment. With it, clients stagger retries across 10 seconds and load spreads out.

How load shedding, virtual queue, and circuit breaker compose:

Incoming request
    → Circuit breaker: is downstream healthy? NO → fast-fail (don't even try)
    → Load shedding:   are we overloaded? P2 request? → 503 immediately
    → Virtual queue:   thread free? YES → process. NO → queue → drain → process

Each layer handles a different failure mode. Circuit breaker: downstream is broken. Load shedding: we're at capacity, cut by priority. Virtual queue: burst is temporary, buffer and drain.

Real example: Black Friday at any major e-commerce company — recommendations and personalization are the first features disabled. Checkout and payment are the last to be touched, typically never.

Virtual Waiting Queue

A controlled admission system placed in front of a high-demand service. Users queue before they can access the booking page. The queue meters user flow to prevent system overload and, crucially, to improve UX — during a Taylor Swift on-sale, the seat map fills in milliseconds. Without a queue, users see a chaotic, constantly-shifting map and repeatedly click seats that are already gone. With a queue, they enter when there are seats available for them, or they get a clear wait time.

The senior vs staff distinction: The technically harder solution (real-time seat map via SSE) handles moderately popular events well but breaks down at extreme scale. The virtual queue is architecturally simpler and solves the UX problem more completely. Knowing when to trade technical complexity for a better user experience is a staff-level signal in interviews.

Architecture:

User requests booking page
  → placed in virtual queue (not shown the seat map yet)
  → SSE connection established for position updates
  → Redis sorted set: ZADD queue:{eventId} <timestamp> <userId>

Queue manager (periodic job or event-driven):
  → ZPOPMIN queue:{eventId} N   ← dequeue N users
  → SADD admitted:{eventId} <userId> (with TTL, e.g. 10 min)
  → push "you're admitted" via SSE connection

User's browser receives SSE event → redirects to booking page

Booking Service:
  → on every reservation request, check SISMEMBER admitted:{eventId} <userId>
  → reject if not admitted — even if request is forged or replayed

Key design decisions:

Redis Sorted Set (ZADD queue:{eventId} <timestamp> <userId>) — timestamp as score gives FIFO ordering. ZPOPMIN dequeues the oldest entries atomically. ZRANK returns a user's position for the "you're #4,231 in line" display.
SSE over WebSocket — the queue only needs server → client communication (position updates, admission notification). SSE is simpler: no protocol upgrade, works through HTTP proxies, auto-reconnects natively. Only use WebSocket if you need bidirectional messaging (e.g. user can send a "cancel my place" message).
Admitted set with TTL — SADD admitted:{eventId} <userId> EX 600 (10 minute window). The Booking Service checks this set before allowing any reservation. If the user's admission expires (they walked away), the slot opens for the next person in queue. TTL should match your checkout window.
Dequeue rate — tune based on how fast seats are being sold, not a fixed rate. If 100 seats remain and each user takes ~5 minutes to book, admit 20 users every minute. Admitting too many causes the seat map to fill before they book; too few means slow throughput.

What to tell the interviewer: "For a Taylor Swift on-sale, I'd enable the virtual queue as an admin-controlled feature. Normal events don't need it — SSE seat map updates are enough. The queue sits in front of the Booking Service, not the Event Service, so browsing and discovery still work normally. I'd use Redis Sorted Set for ordering, SSE for updates, and an admitted set with TTL as the gate the Booking Service checks."

Idempotency Keys

Ensures a request produces the same result no matter how many times it's sent. Critical for retries.

NFR triggers: Durability, consistency, payments.

How it works: Client generates a unique key (UUID) and sends it with the request. Server stores (idempotency_key → result). On duplicate request, returns stored result instead of re-executing.

Why it matters: Networks fail. Clients retry. Without idempotency, a payment retry = double charge.

Real example: Stripe requires an Idempotency-Key header on all payment requests. POST to /charges twice with the same key returns the same charge object, not two charges.

Consistent Hashing

Distributes data across nodes such that adding or removing a node only remaps a small fraction of keys.

NFR triggers: Scale (sharding, distributed cache), fault tolerance.

How it works: Nodes and keys are both hashed onto a circular ring (0 to 2³²). A key is assigned to the first node clockwise from its position. When a node is added, it takes keys from only its clockwise neighbor. When removed, its keys move to the next node.

Virtual nodes: Each physical node has multiple positions on the ring. Distributes load more evenly and handles heterogeneous hardware.

Limits & behaviour:

Adding a node: remaps only ~1/n of keys (n = total nodes) — compared to 100% remapping with naive modulo hashing
Virtual nodes: 100–200 per physical node is typical for even distribution
Overhead: O(log n) lookup on the ring — negligible

Real example: Cassandra and DynamoDB use consistent hashing to partition data across nodes. Memcached clients use it to distribute cache keys.

Write-Ahead Log (WAL)

Before any data change is applied, the change is written to an append-only log on disk. On crash, the log is replayed to recover — no data loss beyond what's in the log.

NFR triggers: Durability (RPO = 0 or near-zero), crash recovery.

How it works: Every write first appends to the WAL. If the server crashes mid-write, the WAL is replayed on restart. The actual data files are only updated after the WAL entry is safely on disk.

Also used for replication: PostgreSQL streams its WAL to replicas. Replicas replay the log to stay in sync — this is how read replicas work internally. Debezium reads the WAL to power CDC.

Who uses it:

PostgreSQL — WAL for crash recovery and streaming replication
MySQL InnoDB — called the redo log, same concept; binlog (used for CDC and replication) is also WAL-based
Cassandra — called the commit log; written before the memtable on every write
Kafka — each partition is an append-only log on disk; WAL is the design, not just an implementation detail
Redis (AOF mode) — Append-Only File logs every write command before applying it; off by default, enable when durability matters
RocksDB / LevelDB — WAL before writing to the in-memory memtable (LSM tree architecture)

Interview signal: When you say "PostgreSQL with a read replica," WAL is why that works. When you say "Debezium for CDC," WAL is what it reads. When you say "Kafka for durability," WAL is the mechanism. You don't add WAL to a design — you choose components that implement it when you need durability guarantees.

API Security

How clients prove who they are, how tokens are issued and stored, and how requests are protected from tampering and replay. Covers user-facing JWT auth and server-to-server HMAC signing.

The complete sequence from submitting credentials to making authenticated requests.

Step 1 — User submits credentials:

POST /auth/login
Body: { "email": "user@example.com", "password": "secret" }

Step 2 — Server validates and issues tokens:

Look up user by email in the DB
Hash the submitted password and compare to stored hash (bcrypt)
If valid, create two tokens — an access token and a refresh token (see Two-Token Model below for what each is and why two)
Return both tokens to the client

Response: 200 OK
{
  "access_token": "eyJhbGci...",   ← JWT, short-lived (15 min)
  "refresh_token": "d8f3a9bc..."   ← opaque string, long-lived (30–90 days)
}

Step 3 — Client stores the tokens:

Platform	Access Token	Refresh Token
Web (browser)	Memory (JS variable)	HttpOnly cookie — browser sends it automatically, JS cannot read it
iOS	Memory	Keychain (hardware-backed encryption)
Android	Memory	Android Keystore

Never store tokens in localStorage — any XSS script on the page can read it.

localStorage is the browser's built-in key-value store (~5 MB per site, persists across tabs and browser restarts). Any JS running on the page has full read access — no restrictions.
XSS (Cross-Site Scripting) is an attack where malicious JavaScript gets injected into your page — through a compromised third-party script, a comment field that wasn't sanitised, or a browser extension. That script can call localStorage.getItem("access_token") and send it to an attacker's server. Storing tokens in memory (a JS variable) instead means they disappear when the tab closes and are not accessible outside your own code.

Step 4 — Client makes authenticated requests:

GET /api/orders
Authorization: Bearer eyJhbGci...   ← access token in header

The server validates the JWT signature on every request — no DB call needed. If valid, the request goes through. If expired, the client gets a 401 and silently refreshes (see Two-Token Model below).

JWT Structure — What's Inside the Token

A JWT is three base64-encoded parts separated by dots: header.payload.signature

Header — algorithm and token type:

{
  "alg": "HS256",
  "typ": "JWT"
}

Payload — claims about the user (not secret — anyone can decode this):

{
  "sub": "user_123",        ← subject: user ID
  "email": "user@example.com",
  "roles": ["user"],
  "iat": 1705329751,        ← issued at (Unix timestamp)
  "exp": 1705330651         ← expires at (15 min later)
}

Signature — proves the token hasn't been tampered with:

HMAC_SHA256(base64(header) + "." + base64(payload), SERVER_SECRET_KEY)

The server re-runs the signature calculation on every request. If the payload was modified, the signature won't match — request rejected. The server never needs to look up the DB to validate; the signature is the proof.

What to put in the payload: User ID, email, roles. Nothing sensitive (passwords, card numbers) — the payload is just base64, not encrypted.

Two-Token Model

The two-token model is not an alternative to JWT — it uses JWT. The access token is a JWT (structured, signed, self-verifying). The refresh token is a plain opaque string stored in the DB. Together they solve the tension between stateless speed and the ability to log users out.

Why two tokens — the tradeoff:

A single long-lived JWT can't be revoked — if stolen, attacker has access until expiry (could be days). Session-based auth is fully revocable but hits the DB on every request. The two-token model is the compromise:

Model	Revocable	DB on every request	Exposure if stolen
Session-based (one token)	Yes — delete the row	Yes	Until revoked
Single long-lived JWT	No	No	Until expiry (could be days)
Two-token (JWT + refresh)	Yes (via refresh token)	No	15 min max

One token for speed, one for control.

Access token — short-lived (15 min), sent with every API request, validated stateless by the server
Refresh token — long-lived (30–90 days), sent only to /auth/refresh, stored in DB so it can be revoked

Why two tokens: A stolen access token expires in 15 minutes. The refresh token never goes to the resource API — it only touches the auth endpoint, which can check the DB and revoke it if needed.

Refresh is silent — the user never notices:

After 15 min the client automatically calls /auth/refresh in the background. The refresh token is in an HttpOnly cookie so the browser sends it automatically — no user action needed. The user keeps browsing without any interruption.

The user only has to log in again when:

The refresh token itself expires (30–90 days) — the "remember me" window
The user explicitly logs out (refresh token revoked in DB)
An admin revokes the session (compromise response)
The user clears their cookies manually

On a site you use daily, you might go months without seeing a login screen.

Refresh flow:

[Access token expires — user doesn't see this]
  → GET /api/orders with expired access token
  ← 401 Unauthorized
  → POST /auth/refresh
    Cookie: refresh_token=d8f3a9bc...   (sent automatically by browser)
  ← 200 OK: { "access_token": "eyJhbGci..." }  (+ new refresh token)
  → Retry GET /api/orders with new access token
  ← 200 OK — user gets their data, never knew anything happened

Token rotation: Issue a new refresh token on every refresh and immediately invalidate the old one. If an attacker steals a refresh token and tries to reuse it after the legitimate user already refreshed, the DB detects the reuse and can revoke the entire session.

Interview tip: Access token = stateless speed. Refresh token = revocable control. The split is the whole point.

HMAC Request Signing

Proves a request came from a legitimate server or merchant and hasn't been tampered with in transit. Used for server-to-server calls and webhooks (Stripe, Twilio, GitHub). Not for browser users — that's JWT.

NFR triggers: Security, B2B API authentication.

How it works:

At signup, generate two keys per merchant:

API Key (pk_live_123) — public identifier, sent with every request
Secret Key (sk_live_abc) — private, never sent over the wire, used only for signing

Per request, the caller builds a canonical string and signs it:

Canonical string = method + path + body + timestamp + nonce
Signature = HMAC_SHA256(secretKey, canonicalString)

Headers sent:
Authorization: Bearer pk_live_123
X-Request-Timestamp: 2024-01-15T14:22:31Z
X-Request-Nonce: a1b2c3
X-Signature: sha256=7f83b1657ff1fc53...

Server verifies: look up secret key via API key → recompute signature → compare → check timestamp freshness → validate nonce hasn't been used before.

API Key = who you are. Signature = proof you're legit. Timestamp + Nonce = proof it's fresh.

Nonce

A random value used exactly once per request to prevent replay attacks — an attacker who captures a valid signed request cannot replay it later.

How it works: Caller generates a UUID per request, includes it in the canonical string and headers. Server stores seen nonces short-term (5–10 minutes, matching the timestamp window). If a nonce appears twice, the second request is rejected — even within a valid timestamp window.

Nonce vs Idempotency Key:

	Nonce	Idempotency Key
Purpose	Security — reject replayed requests	Correctness — safe retries, no double charge
On duplicate	Request rejected	Cached result returned
Stored for	5–15 minutes	12–24 hours
Used by	API security layer	Business logic layer

Stripe relies on HTTPS to prevent replays and uses idempotency keys for correctness — no nonce. Systems with higher security requirements add signing + nonce on top of HTTPS for defence-in-depth.

HSM (Hardware Security Module)

Tamper-resistant hardware that generates, stores, and uses cryptographic keys. Private keys never leave the device in plaintext — even a full server compromise can't extract them.

NFR triggers: Security (PCI-DSS compliance), encryption key management for payment platforms.

How it works in a payment flow:

Browser encrypts card data using your public key (inside a secure iframe served from your domain)
Encrypted payload sent over HTTPS — plaintext card data never exists outside the browser
Backend passes ciphertext to HSM
HSM decrypts internally — plaintext returned to backend only
Backend uses it briefly for processing, then discards
Your backend never sees the private key

Why iframe isolation matters: Browser same-origin policy isolates the iframe (served from pay.platform.com) from the merchant's page. Merchant JS cannot read what the customer types. Card data never touches merchant servers.

What to store in HSM: Card encryption private key, Root Key / Key Encryption Key (KEK), tokenization keys. What NOT to store: Merchant API secrets, TLS certificates — lower trust tier, managed separately.

Cloud options: AWS CloudHSM, AWS KMS (managed), Google Cloud HSM, Azure Key Vault HSM.

Breach comparison:

Without HSM: attacker extracts private key → decrypts all historical card data
With HSM: key is unextractable → attacker may intercept active requests but cannot decrypt historical data

PCI-DSS requires or strongly recommends HSMs for payment platforms. Look for FIPS 140-2 / 140-3 certification.

App Attestation (Mobile Only)

Proves to your backend that the request is coming from a genuine, unmodified version of your app — not a script, emulator, or repackaged APK. JWT tells you who the user is; attestation tells you what the client is.

Apple App Attest (iOS 14+):

App calls Apple's DeviceCheck API → receives a cryptographic assertion
App sends assertion with each request
Server sends assertion to Apple's validation service
Apple confirms: genuine device + unmodified app binary

Google Play Integrity API (Android):

Google issues a signed verdict: MEETS_DEVICE_INTEGRITY, MEETS_BASIC_INTEGRITY, or NO_INTEGRITY
NO_INTEGRITY → rooted device, emulator, or repackaged APK → reject or challenge

Practical rules:

Enforce on sensitive endpoints only (login, payment, signup) — attestation is slow (~200–500ms round trip)
Cache the result server-side per device for 30 min — don't call Apple/Google on every request
Fail open on attestation service errors — fall back to rate limiting
Flag NO_INTEGRITY as high-risk, don't necessarily block — legitimate users on rooted devices exist

Common Attack Vectors

Attack	What Happens	Defense
Credential stuffing	Attacker tries leaked username/password pairs at scale	Rate limit login endpoint, require MFA, flag anomalous login velocity
JWT theft	Access token extracted from insecure storage	Short TTL (15 min) + store in memory/Keychain/Keystore, never localStorage
Refresh token theft	Long-lived token stolen from cookie or storage	HttpOnly cookie (not readable by JS), token rotation with reuse detection
MITM	Attacker intercepts traffic on public WiFi	TLS everywhere + certificate pinning on mobile
Replay attack	Attacker captures and replays a valid signed request	Nonce + timestamp window on HMAC-signed APIs
IDOR	User A modifies request to access User B's resource	Server-side ownership check on every request — never trust client-provided user IDs
Repackaged APK	Attacker strips SSL pinning from app, redistributes	App attestation (Google Play Integrity rejects modified binaries)

Certificate pinning tradeoff: Pinning prevents MITM but breaks on certificate rotation — requires a forced app update. Use Public Key Pinning (pin the key, not the cert) so cert rotation doesn't break the pin.

Auth Flow — Full Picture

How login, token storage, API requests, and refresh fit together end-to-end:

[Browser / Mobile App]
    │
    ├── 1. POST /auth/login { email, password }
    │         ↓
    │   [Auth Service]
    │         verify password hash (bcrypt)
    │         issue access token (JWT, 15 min) ──────────────────────────────┐
    │         issue refresh token (opaque, 90 days, stored in DB) ───────────┤
    │         ↓                                                               │
    │   Response: access_token in body, refresh_token in HttpOnly cookie     │
    │                                                                         │
    ├── 2. GET /api/resource                                                  │
    │   Authorization: Bearer <access_token>   ◄────────────────────────────┘
    │         ↓
    │   [API Gateway]
    │         validate JWT signature (stateless, no DB call)
    │         extract user ID + roles from payload
    │         ↓
    │   [Backend Service]
    │         check resource ownership (IDOR prevention)
    │         return data
    │
    ├── 3. Access token expires → 401 received
    │
    └── 4. POST /auth/refresh
              Cookie: refresh_token=...   (sent automatically)
              ↓
        [Auth Service]
              look up refresh token in DB
              validate not revoked, not expired
              issue new access token
              rotate refresh token (invalidate old, issue new)
              ↓
        Response: new access_token
              → retry original request

Distributed Coordination

How distributed nodes agree on who's in charge, who's alive, and what the current config is.

Coordination Services — etcd, Consul, ZooKeeper

All three solve the same problems: leader election, distributed locks, config management, service discovery. They differ in design philosophy and ecosystem.

	etcd	Consul	ZooKeeper
Consensus	Raft	Raft	ZAB (similar to Paxos)
Primary use	Kubernetes cluster state, Postgres HA (Patroni)	Service discovery + health checks + service mesh	Older Kafka, HBase, Hadoop
Data model	Key-value	Key-value + service catalog	Hierarchical (filesystem-like znodes)
Relevance today	High — etcd is the modern default	High — strong in service mesh space	Declining — Kafka dropped it in v3.x
Consistency	Strong (CP)	Strong (CP)	Strong (CP)

What they all provide:

Leader election — only one node acts as primary at a time
Distributed locks — only one worker processes a job at a time
Watch / notify — clients watch a key and are notified instantly on change
Health-based membership — nodes register themselves, dead nodes are removed

Consul in depth — built for microservices:

Service registration and discovery is Consul's primary job. Every service registers itself on startup with its name, address, port, and tags. Other services look up healthy instances by name — via DNS (my-service.service.consul) or HTTP API. No hardcoded IPs anywhere.

Service registration flow:

payment-service starts
  → registers with Consul: "I am payment-service at 10.0.1.5:8080"
  → Consul runs health check every 10s (HTTP /health endpoint)

order-service wants to call payment-service
  → queries Consul: "give me healthy instances of payment-service"
  → Consul returns [10.0.1.5:8080, 10.0.1.9:8080]
  → order-service picks one (or uses DNS round-robin)

payment-service instance crashes
  → health check fails → instance removed from pool automatically
  → order-service's next query gets only healthy instances

Service mesh (Consul Connect): Consul can inject a sidecar proxy next to each service that handles mTLS encryption and traffic policies — without any application code changes. Services talk to their local sidecar; the sidecar handles identity, encryption, and routing.

Multi-datacenter: Built-in. Consul federates across DCs natively. Services in US-East can discover services in EU-West through the same API.

etcd vs Consul for microservices:

	etcd	Consul
Best for	Kubernetes cluster state, leader election	Microservice networking, service discovery
Health checking	No	Yes — active checks every N seconds
Service mesh	No	Yes (Connect sidecar)
Multi-DC	No	Yes — native federation
Discovery API	Manual (just a KV store)	First-class (DNS + HTTP)

Kubernetes shops tend to use etcd (it's already there) + kube-dns for discovery. Non-Kubernetes microservice shops reach for Consul.

ZooKeeper — legacy context:

Built at Yahoo in 2007, open-sourced in 2008 — the original distributed coordination tool. Uses ZAB (ZooKeeper Atomic Broadcast), a custom consensus protocol similar to Paxos. Became the backbone for Kafka, HBase, and Hadoop.

Key reasons it's being replaced:

Kafka dropped it in v3.x (KRaft mode): ZooKeeper was a scaling bottleneck — each partition change required a ZooKeeper round-trip. Kafka now has built-in Raft consensus.
Heavier to operate: Requires a separate ZooKeeper cluster with its own JVM processes.
More complex API: Hierarchical filesystem model (znodes) is harder to work with than etcd's flat key-value model.

You'd still see it in legacy Hadoop/HBase stacks or older Kafka deployments (pre-3.x). You wouldn't choose it for a new system today.

Eureka — Netflix OSS, eventual consistency:

Service discovery only — no locks, no KV store, no active health checking. Services self-report their health via periodic heartbeat. Eureka favors availability over consistency: if Eureka's replication network splits, each node keeps serving its own view rather than rejecting queries.

If a service crashes without deregistering, it stays in the pool for ~90 seconds (heartbeat timeout) before eviction — you can get routed to a dead instance briefly
Lives in the Spring Cloud / Netflix OSS ecosystem — common in Java microservice shops on AWS
Simpler to run than Consul but weaker guarantees — acceptable when occasional routing to a dead instance is survivable and you have retry logic

In an interview: Say "I'd use etcd for leader election" or "Consul for service discovery in a microservices stack." ZooKeeper is worth knowing for context (Kafka history) but wouldn't appear in a new design today. Eureka is fine to mention if you're discussing Netflix-style eventual-consistency tradeoffs.

Gossip Protocol

How nodes in a cluster discover each other and spread state without a central coordinator. Used by Cassandra, Redis Cluster, DynamoDB, Consul.

How it works:

Every node maintains a list of known peers with their state (alive/dead, metadata).
Every N seconds, each node picks K random peers and exchanges its state list.
New information spreads like a rumor — each exchange infects more nodes.
Within O(log n) rounds, all nodes know about all changes.

Node A learns Node D joined
→ A tells B and C
→ B tells E and F, C tells G and H
→ All nodes know about D in log(n) rounds

What it's used for:

Failure detection — if a node stops gossiping, peers mark it as suspect, then dead
Membership — who's in the cluster right now?
Metadata propagation — spread token ranges (Cassandra), shard assignments, config changes

Properties:

Fully decentralised — no single point of failure
Eventually consistent — all nodes converge on the same view, but not instantly
Scales well — O(log n) convergence regardless of cluster size
Not suitable for strong consistency — use etcd/Raft for that

Real examples: Cassandra uses gossip for all cluster membership. Redis Cluster uses it to propagate slot assignments. Consul uses it for node health state.

Raft Consensus

The algorithm behind etcd, CockroachDB, Consul, and Kafka KRaft. Worth understanding conceptually — you'll reference it when explaining how leader election and replication work.

Core idea: A cluster of N nodes elects one leader. All writes go through the leader. Leader replicates to a majority (quorum) before committing. As long as a majority of nodes are alive, the cluster makes progress.

Three roles:

Leader — handles all writes, sends heartbeats to followers
Follower — accepts log entries from leader, votes in elections
Candidate — node that has timed out waiting for heartbeats and is requesting votes

Leader election:

Follower receives no heartbeat → becomes Candidate → increments term → requests votes
First candidate to get majority votes becomes Leader
Leader sends heartbeats — if followers stop hearing them, new election starts

Log replication:

Client sends write to Leader
Leader appends to its log, sends to all Followers
Once majority confirm → Leader commits → responds to client
Followers apply committed entries

Quorum = majority = n/2 + 1. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures.

In an interview: "We'd run etcd with a 3-node cluster using Raft — tolerates one node failure, leader election completes in < 1 second."

Distributed Locks

Prevents two workers from processing the same job or two leaders from writing simultaneously.

Why not application-level locking?

In-memory locks only exist within one instance — two instances can both acquire the same lock simultaneously. If an instance crashes, its lock is orphaned forever. The fix: move the lock into a shared, atomic, external store — which is exactly what Redis SETNX provides.

Why not just use a database transaction lock? DB transaction locks are designed for millisecond-duration operations — they hold for the life of a transaction. Distributed locks are for longer-duration holds, like reserving a concert ticket for 10 minutes while a user completes checkout. Using a DB transaction for that would block the table and kill throughput.

Real-world use cases:

Ticketing / e-commerce checkout — lock a seat or limited-stock item for ~10 minutes while user completes payment; no one else can grab it
Ride-share driver assignment — lock a nearby driver when a rider requests; prevents the driver being matched to two riders simultaneously
Distributed cron jobs — ensure a scheduled task (e.g. daily report aggregation) runs on exactly one server even though many servers are eligible
Auction bidding — briefly lock an item in the final seconds of bidding to process each bid sequentially and update the highest bid atomically

Locking granularity: You can lock a single resource (one ticket) or a group (all seats in a section). Finer granularity = more concurrency but more lock management overhead.

Option 1 — Redis SETNX (simple, fast):

SET lock:job_123 worker_A NX PX 30000

NX = only set if not exists. PX 30000 = expire in 30 seconds (prevents deadlock if worker crashes).

Fast (~1ms)
Single Redis node = single point of failure
Clock drift can cause two nodes to hold the lock simultaneously

Option 2 — Redlock (Redis, multiple nodes): Acquire lock on majority (3 of 5) Redis nodes within a time window. More robust than single-node but controversial — clock drift still a theoretical issue.

Option 3 — etcd lock (strongest guarantee): etcd uses Raft — lock acquisition is linearizable. No clock drift issues. Slower than Redis (~5–10ms) but correct.

The Redlock controversy: Martin Kleppmann (author of Designing Data-Intensive Applications) argued that Redlock is unsafe because a process can pause mid-execution (JVM GC pause, OS scheduling) after acquiring the lock. By the time it resumes, the TTL may have expired across all Redis nodes, and another process has acquired the same lock. Both now believe they hold it — two workers execute the critical section simultaneously. Redlock has no way to detect this.

Antirez (Redis creator) rebutted that the scenario requires extreme GC pauses unlikely in practice. The debate was never fully resolved. The practical takeaway: Redlock is fine for workloads where occasional double execution is survivable. For true safety, use etcd with fencing tokens.

Fencing tokens — the correct solution for strong safety:

Every etcd write returns a monotonically increasing revision number. You pass this number to the protected resource as a token. The resource rejects any request whose token is lower than the last one it saw — so even if a stale process wakes up after its lock expired, the resource rejects it.

Process A acquires lock → gets token 42
Process A pauses (GC)
Lock expires → Process B acquires lock → gets token 43 → writes with token 43
Process A wakes up → tries to write with token 42 → resource rejects (42 < 43) ✓

Redis has no equivalent — it cannot detect that a process holding a "valid" lock is actually stale. If you need this level of correctness, use etcd.

When to use each:

Redis SETNX — job deduplication, idempotency, rate limiting. OK if occasional double processing is survivable.
etcd lock — leader election, critical section where correctness is mandatory (financial, inventory).

Deadlocks: Occur when two processes each hold one lock and wait for the other. Process A holds lock-X and wants lock-Y; Process B holds lock-Y and wants lock-X — both wait forever. Prevention: always acquire multiple locks in a consistent global order, and keep lock acquisition centralized in one place in code so the ordering is easy to enforce and audit.

Read-path complexity with Redis locks: When reservations live in Redis rather than the DB, reads (e.g. showing a seat map) need to know which seats are locked. Two patterns:

Redis Set per event: Maintain a event:{eventId}:reserved Set alongside each lock. Seat map query does one Redis round-trip to get all reserved IDs — fast in practice.
Write-through to DB: On lock acquire, write a "reserved" status to the DB row. Redis TTL remains the source of truth for expiry; a periodic sweep cleans up stale DB reservations. Seat map reads from DB normally — no extra Redis call.

Concurrency Control at the Database Layer

Distributed locks (Redis, etcd) handle the coarse-grained, longer-duration reservation window. But the actual write — when the ticket is finally booked — needs DB-level concurrency control as the last line of defence. Two mechanisms:

OCC — Optimistic Concurrency Control

Assumes conflicts are rare. Don't lock upfront — detect collisions at write time using a version field.

How it works:

Read the row — it has version = 5
Do your work (fill in payment details, build the update)
Write with a version check: UPDATE ticket SET status='BOOKED', version=6 WHERE id=123 AND version=5
If another transaction already incremented to version 6, your WHERE matches 0 rows → conflict detected → throw error → caller retries or issues refund

Hibernate / JPA makes this automatic:

@Entity
public class Ticket {
    @Id
    private Long id;

    @Version
    private int version;  // Hibernate reads, checks, and increments this automatically
}

Hibernate generates the WHERE version = X check on every UPDATE. If the check fails, it throws OptimisticLockException. Works the same in any JPA provider (EclipseLink, etc.) and Spring Data JPA inherits it directly.

Best for: High-read, low-conflict workloads. Ticket booking (most users don't race), inventory, order creation. Fails loudly on conflict — caller must handle the retry or compensate (e.g. automatic refund).

Pessimistic locking — SELECT FOR UPDATE

Acquires a row-level lock at read time. No other transaction can write that row until you commit or roll back.

BEGIN;
SELECT * FROM ticket WHERE id = 123 FOR UPDATE;  -- row locked now
-- do work
UPDATE ticket SET status = 'BOOKED' WHERE id = 123;
COMMIT;  -- lock released

DB support:

DB	Support	Notes
PostgreSQL	Yes	Full row-level locking
MySQL / MariaDB	Yes	InnoDB engine only
Oracle	Yes	Also supports `SKIP LOCKED`
SQL Server	Different syntax	`WITH (UPDLOCK, ROWLOCK)` hints
SQLite	No	File-level locking only
MongoDB	No	Use `findAndModify` / conditional writes
DynamoDB	No	Use `ConditionExpression` (same OCC idea, different API)
Cassandra	No	Lightweight transactions (`IF` clause) — expensive, avoid

SKIP LOCKED variant (PostgreSQL, MySQL 8+, Oracle): Skips rows already locked by another transaction instead of blocking. Perfect for job queues — multiple workers each grab a different job row without blocking each other.

SELECT * FROM jobs WHERE status = 'pending'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 1;

Best for: Short-duration critical sections where blocking is acceptable — final booking step, inventory decrement, job queue polling. Not for the 10-minute reservation window (blocks the row for too long, kills concurrency).

Coupling caveat: SELECT FOR UPDATE ties you to a relational DB that supports it. SQL Server uses different syntax (WITH (UPDLOCK, ROWLOCK) hints). SQLite, DynamoDB, MongoDB, and Cassandra don't support it at all — you'd have to rearchitect concurrency control entirely if you ever migrated. Even with an ORM, LockModeType.PESSIMISTIC_WRITE (Hibernate/JPA) abstracts the syntax but not the requirement for a relational DB with row locking.

OCC has no such coupling — the version check pattern (WHERE version = X) is implementable on any storage engine: relational DBs, DynamoDB (ConditionExpression), MongoDB (findAndModify with a version filter), even Redis with a Lua script. Prefer OCC by default; reach for SELECT FOR UPDATE only when you specifically need blocking semantics (e.g. SKIP LOCKED for job queues, where OCC would cause constant retry storms under high contention).

The layered model for ticketing (or any reservation system):

Layer 1 — Redis distributed lock (TTL ~10 min)
  Holds the seat for one user during checkout
  Fast, doesn't touch DB, releases on TTL or explicit unlock

Layer 2 — DB OCC (version check) or SELECT FOR UPDATE
  Fires at the moment of final booking write
  Safety net: if Redis lock fails or TTL races, DB ensures only one write wins

If both layers fail → idempotency key on the request as last resort

TTL-expires-mid-payment: If User A's lock TTL expires during payment, User B can grab it. When both try to write, OCC ensures only one succeeds. The loser gets a conflict error → issue automatic refund via the payment processor. Mitigation: set TTL generously (10 min for checkout) and extend the TTL when payment is initiated.

Service Discovery

How services find each other's IP addresses in a dynamic environment where instances come and go.

The problem: Service A needs to call Service B. Service B runs on 10 dynamic instances whose IPs change on every deploy. Hardcoding IPs doesn't work.

Two patterns:

Client-side discovery:

Service A → query Consul/etcd → get list of Service B instances → pick one → call directly

Service A is responsible for load balancing. More control, more complexity in the client.

Server-side discovery:

Service A → call load balancer → LB queries registry → routes to Service B instance

Client is dumb — just hits one endpoint. LB handles discovery. Kubernetes Service works this way.

Health checks: Registry continuously health-checks registered instances. Unhealthy instances removed from the pool automatically. New instances register themselves on startup.

Real examples:

Kubernetes: pods register with kube-dns, Services route via iptables rules
AWS ECS + ALB: tasks register with target group, ALB discovers healthy tasks
Consul: services self-register with health check URLs, clients query Consul DNS

Change Data Capture (CDC)

Streams every database insert, update, and delete to an immutable event log — without changing application code. The DB stays optimised for CRUD; the event stream provides audit durability, downstream sync, and analytics.

NFR triggers: Durability, auditability, compliance (PCI-DSS, SOX), event-driven architecture.

How it works:

[App] → [Primary DB] → [CDC Connector reads WAL/binlog] → [Kafka] → [Consumers]
           (mutable, fast)                                  (immutable, durable)

App writes to DB normally — no code changes needed
CDC connector (Debezium, AWS DMS) reads the database transaction log (WAL for Postgres, binlog for MySQL)
Every committed change emitted as a before/after event to Kafka
Consumers process events independently: audit service, analytics, reconciliation, webhook delivery

Example CDC events:

{ "op": "insert", "table": "payment_intents",
  "after": { "id": "pi_123", "status": "created", "amount": 2500 } }

{ "op": "update", "table": "payment_intents",
  "before": { "id": "pi_123", "status": "created" },
  "after":  { "id": "pi_123", "status": "authorized" } }

Key guarantee: If a change commits to the DB, it appears in the event stream. CDC operates at the database log level — no application code can accidentally skip it.

Common use cases:

Audit trails for PCI-DSS, SOX, GDPR compliance
Syncing operational DB to analytics / data warehouse
Cache invalidation (update Redis when DB row changes)
Cross-service replication without dual writes
Reconciliation (correlate internal events with external payment network events)

Tools:

Debezium — open source, supports Postgres/MySQL/MongoDB/Oracle/SQL Server. Two deployment modes (see below).
AWS DMS — managed CDC to AWS targets. No Kafka required.
Kafka Connect — framework that hosts Debezium connectors in the Kafka ecosystem.
Fivetran / Airbyte — managed CDC for analytics pipelines.

Debezium deployment modes — Kafka is not required:

Mode	Architecture	When to use
Debezium + Kafka Connect	WAL → Debezium → Kafka → consumers	Kafka already in your stack. Multiple downstream consumers (ES + cache + audit all read independently). Need replay if a consumer goes down.
Debezium Server	WAL → Debezium Server → sink directly	Don't want Kafka ops overhead. Single downstream target. Sinks: Kinesis, Pub/Sub, Redis Streams, HTTP webhook, Elasticsearch directly.

What you lose without Kafka in the middle:

Replay — if Elasticsearch goes down for 2 hours, with Kafka it catches up from its last offset on recovery. Without Kafka, those events are gone.
Fan-out — with Kafka, multiple consumers read the same stream independently (ES + cache invalidation + audit). Without Kafka, you need separate Debezium instances per sink.
Durability buffer — Kafka absorbs spikes if the sink is slow or temporarily down.

Interview guidance:

If Kafka is already in your stack → Kafka Connect + Debezium. Fan-out and replay come free.
If you only need one downstream target and want less infra → Debezium Server directly to that sink.
On AWS → AWS DMS is a fully managed alternative: no Debezium to operate, integrates natively with RDS, Aurora, and DynamoDB as targets.
Always mention the replication slot gotcha for PostgreSQL: Postgres holds WAL segments until Debezium confirms it has read them. If Debezium falls behind or stops, WAL grows unboundedly and can fill your disk. Monitor replication slot lag as a first-class alert.

Retention: Hot retention on Kafka (7–30 days for fast replay). Archive to S3 with write-once policy for permanent compliance storage.

Tradeoffs:

Adds infrastructure (Kafka, connectors)
Eventual consistency between DB and event stream (~milliseconds)
Schema changes require planning (adding columns can break consumers)

CDC as a SPOF concern: If the CDC connector stops, DB writes continue but events stop flowing — missed audit records downstream. Mitigations: run redundant CDC instances, aggressive lag monitoring (alert within seconds), maintain replay procedures from WAL offsets to backfill missed events.

Interview one-liner: "We use CDC to stream DB changes to an immutable Kafka log — audit durability without changing application code or impacting primary DB performance."

Object Storage & File Upload

Patterns for handling user-generated file uploads at scale — photos, videos, documents — without routing large payloads through your backend servers.

Terminology used in this section:

File — the logical entity tracked in your database (has a fileId, metadata, status)
Object — the physical storage unit in S3 (what S3 actually stores)
Upload — the act of transferring bytes from client to S3
Presigned URL — a short-lived, signed URL that grants temporary permission to read or write a specific S3 object

Simple Upload — Presigned URL Pattern

Never proxy file uploads through your backend. At scale, routing even 10MB photos through your API servers exhausts bandwidth, memory, and connection limits. Instead, generate a short-lived presigned URL and let the client upload directly to S3.

Flow:

1. App requests upload permission from backend
   POST /uploads { trip_id, photo_type: "pickup", file_size, mime_type }
   → Backend validates: does this user own this trip?

2. Backend generates a presigned S3 URL
   → Scoped to exact key: trips/{trip_id}/pickup/{uuid}.jpg
   → Short TTL: 5–15 minutes
   → Returns { upload_url, key } to client

3. Client uploads directly to S3
   PUT {upload_url}  ← file bytes go here, not to your servers
   → No auth headers needed — permissions baked into the URL

4. S3 triggers async processing (event notification)
   → s3:ObjectCreated event → SQS/Lambda → processing pipeline
   → Do NOT rely on client notifying backend — client can crash, go offline, or lie

S3 key design matters: Scope the presigned URL to the exact key (trips/{trip_id}/pickup/{uuid}.jpg), not a prefix. Otherwise a malicious user could overwrite another user's files by constructing a different key. The key should include a UUID to prevent collisions.

For large files, a single PUT breaks down in four ways. That's where multipart upload comes in.

Multipart Upload — Large Files

For files larger than ~100MB, the same presigned URL pattern applies — but instead of a single PUT, the file is split into chunks and each chunk gets its own presigned URL.

Why a single upload fails for large files

A single POST request for a large file fails in four ways:

Timeouts — servers, load balancers, and clients have timeout limits (typically 30s–2 min). A 50GB file on a 100Mbps connection takes 1.1 hours (50GB × 8 / 100Mbps = 4,000s). The connection times out long before the upload finishes.
Payload size limits — Amazon API Gateway hard-limits requests to 10MB with no override. Most infrastructure has similar caps. A 50GB file simply cannot be sent as a single POST.
Network interruptions — if the connection drops mid-upload, the entire transfer must restart from zero.
No progress visibility — the user has no indication of how far along the upload is or whether it's working.

The fix: split the file into chunks and upload each independently.

Identifying files by content (fingerprinting)

Large file uploads introduce two new problems: how do you resume an interrupted upload, and how do you avoid uploading the same file twice? The answer is fingerprinting — computing a SHA-256 hash of the file's content before any upload begins.

You cannot identify a file by its name — two different users can upload files with the same name, and the same user can rename a file. Instead, compute a fingerprint: a SHA-256 hash of the file's content. Same content = same fingerprint, regardless of filename or who uploaded it.

Two fields in your metadata table serve different purposes:

FileMetadata table:
| fileId (UUID) | fingerprint (SHA-256)      | ownerId | status    |
|---------------|----------------------------|---------|-----------|
| f1            | a3f5c8...                  | user1   | uploaded  |
| f2            | a3f5c8...                  | user2   | uploading |

fileId — unique identifier for this file record (UUID, always unique)
fingerprint — hash of the content, used for deduplication and resumability

The fingerprint answers two questions before any upload begins:

"Have I uploaded this file before?" — check if a record with this fingerprint exists for this user
"Is there an in-progress upload I can resume?" — check if status is uploading

Chunk-level fingerprinting: Each chunk also gets its own SHA-256 hash. This lets you identify exactly which chunks have already been uploaded during a resume, without re-uploading or re-checking all of them.

How S3 multipart upload works

S3 multipart upload has three API calls that tie everything together:

CreateMultipartUpload — your backend calls this before any chunk is uploaded. S3 returns an uploadId — a session identifier that ties all the parts together. Every presigned URL generated for a part must include this uploadId and the part's number.
Upload parts — the client uploads each chunk directly to S3 using its presigned URL (which embeds uploadId + partNumber). S3 stores each part separately and returns an ETag for each.
CompleteMultipartUpload — once all parts are uploaded, your backend calls this with the list of part numbers and their ETags. S3 assembles all parts into a single object. Only after this call does the object exist as a complete unit in S3.

Full upload flow end to end

[Client]
  1. Chunk file into 5–10 MB pieces
     Compute SHA-256 fingerprint for the whole file
     Compute SHA-256 fingerprint for each chunk

  2. GET /files?fingerprint={hash}
     → Does a file with this fingerprint already exist for this user?
     → If status = "uploaded"  → skip, file already exists, no upload needed
     → If status = "uploading" → fetch existing chunk statuses, resume from where it left off
     → If not found           → proceed to initiate

  3. POST /files/multipart-init
     { fingerprint, filename, totalChunks, chunkFingerprints[] }
     →  Backend calls S3 CreateMultipartUpload → gets uploadId
     →  Backend generates a presigned URL per chunk (each scoped to uploadId + partNumber)
     →  Backend saves FileMetadata (status: "uploading") + FileChunks (all status: "pending")
     ←  Returns { fileId, uploadId, presignedUrls[] }

  4. For each chunk (in parallel):
     PUT {presignedUrl}  ← chunk bytes go directly to S3
     ←  S3 returns ETag for that chunk

     PATCH /files/{fileId}/chunks
     { chunkId, etag }
     →  Backend calls S3 ListParts → verifies ETag actually exists in S3
     →  Only if confirmed: update FileChunks status to "uploaded"

  5. Once all chunks are "uploaded":
     →  Backend calls S3 CompleteMultipartUpload with all partNumbers + ETags
     →  S3 assembles all parts into one object
     →  Backend updates FileMetadata status to "uploaded"
     →  Processing pipeline triggered (virus scan, resize, etc.)

At this point:

All chunks have been uploaded directly to S3
Each ETag has been verified against S3's ListParts API
The DB reflects confirmed chunk state
Your backend has called CompleteMultipartUpload
The object exists as a single unit in S3, ready for processing

Verifying chunk uploads with ETags

S3 and your backend are completely separate systems — when a client uploads a chunk to S3, your backend has no idea it happened. The client bridges the gap by sending a PATCH after each chunk upload.

PATCH (not POST or PUT) because you're partially updating an existing resource — just the status of one chunk, not creating or replacing the whole file.

Your DB tracks chunk state:

FileChunks table:
| fileId | chunkId | fingerprint | etag                | status   |
|--------|---------|-------------|---------------------|----------|
| f1     | chunk1  | 3d4e9a...   | d41d8cd98f00b204... | uploaded |
| f1     | chunk2  | 7bc12f...   | null                | pending  |
| f1     | chunk3  | 9fa33c...   | null                | pending  |

The trust problem: Nothing stops a malicious user from sending a PATCH marking all chunks as uploaded without uploading anything. Your DB says complete; S3 has nothing. The file is broken and your system is in an inconsistent state.

The fix — ETag verification: ETags can't be faked — S3 generates them from the actual bytes received. Your backend verifies each ETag by calling S3's ListParts API, which returns all parts S3 actually holds for that uploadId. A fabricated ETag won't match — the fraud is caught before the DB is updated.

Why not use S3 event notifications? S3 only fires an ObjectCreated event when CompleteMultipartUpload is called — not for individual part uploads. You cannot use S3 events for per-chunk progress tracking. ListParts is the right API.

Trust but verify: The client drives progress updates for real-time UX. Your backend independently verifies against S3 before committing state to the DB.

Downloads don't need chunking

Once CompleteMultipartUpload is called, S3 assembles all parts into a single object. From that point on, downloads work like any normal file — the client gets one presigned URL (or CDN signed URL) and downloads the complete file. The original chunk boundaries are gone.

For very large files, HTTP natively supports Range requests: the client can request specific byte ranges in parallel or resume an interrupted download without starting over. The client doesn't need to know anything about the original chunk structure — S3 and HTTP handle it transparently.

At this point the object exists in S3 — but it hasn't been scanned, stripped of metadata, or validated. It is not yet safe to serve to users.

Processing Pipeline — Two-Bucket Strategy

Raw user uploads must never reach users directly. An uploaded object could contain malware, unstripped GPS metadata, illegal content, or simply be corrupt. The two-bucket pattern enforces a processing gate between upload and serving.

Architecture:

[Client]
    ↓  (presigned URL upload)
S3 Raw Bucket  ←  private, no public access, no CDN
    ↓  (s3:ObjectCreated event)
SQS / Lambda trigger
    ↓
Async Processing Workers:
    ├── Virus / malware scan          ← reject and quarantine if infected
    ├── EXIF strip                    ← remove GPS coords, device ID, timestamps
    ├── Image validation + resize     ← reject corrupt files, generate thumbnails
    ├── ML content moderation         ← detect NSFW, violence, illegal content
    └── Perceptual hash (dedup)       ← detect re-uploaded duplicate content
    ↓  (only on pass)
S3 Processed Bucket  ←  CDN-accessible, public reads allowed
    ↓
CloudFront CDN → Users

At this point:

The object has passed malware scanning, EXIF stripping, validation, and content moderation
It has been copied to the processed bucket
It is accessible via CloudFront CDN
Users will only ever receive CDN URLs — never raw S3 URLs

Why each processing step matters:

Virus scan — user uploads are untrusted. Serving an infected file to other users is a serious incident.
EXIF strip — photos contain embedded metadata: GPS coordinates (exact home address), device model, serial number, timestamp. Strip before serving to protect user privacy.
Validation + resize — reject zero-byte files, non-images claiming to be images, oversized files. Generate thumbnail variants (128px, 512px, 1080px) so clients don't download 12MP originals for a thumbnail slot.
ML content moderation — detect NSFW or illegal content automatically. Flag for human review or auto-reject based on confidence threshold.
Perceptual hashing — generate a hash of the image's visual content (not file bytes). Two identical photos re-encoded as different files have the same perceptual hash. Detect re-uploads, spam, and copyrighted content.

Users never get direct S3 URLs. They get CloudFront CDN URLs pointing to the processed bucket. Direct S3 URLs bypass CDN caching, expose your bucket structure, and can't be invalidated cleanly.

Trigger processing from S3, not from client: S3 event notifications (s3:ObjectCreated) trigger the processing pipeline directly. Don't depend on the client calling your backend to say "I finished uploading" — the client can crash, lose connectivity, or be malicious. If the S3 event triggers processing, you get guaranteed delivery regardless of client behaviour.

What-if: processing fails mid-pipeline?

If a worker crashes after malware scan but before EXIF strip, the file stays in the raw bucket — never promoted to the processed bucket, never visible to users. The SQS message re-queues and the worker retries from the start. Each processing step should be idempotent — re-running is safe. The file only moves to the processed bucket after all steps pass.

What-if: orphaned raw uploads?

A presigned URL expires after 15 minutes, but a file might be uploaded and then fail processing permanently (corrupt file that always fails validation). Apply an S3 lifecycle policy on the raw bucket: delete objects older than 24 hours that haven't been promoted. This keeps the raw bucket clean without manual intervention.

Interview answer: "Uploads go directly from the client to S3 via presigned URL — no bytes through our servers. On object creation, S3 fires an event to SQS, which triggers our async processing pipeline: malware scan, EXIF strip, resize, content moderation. Each step is idempotent — if a worker crashes it retries safely. Only on full pass does the object get copied to the processed bucket behind CloudFront. Users only ever receive CDN URLs, never raw S3 URLs. Orphaned objects in the raw bucket are cleaned up by a lifecycle policy after 24 hours."

Once a file passes all processing steps and lands in the processed bucket, it's ready to be served. The next question is how to control who can access it.

Download — Presigned URL Pattern

The upload pattern gets objects into S3. This pattern controls who can read them back out.

The problem: Your S3 bucket is private (no public access). You need to let an authorised user download their file — but you don't want to proxy the bytes through your backend, and you don't want to make the bucket public.

The solution: Generate a short-lived presigned GET URL server-side. The client uses it to fetch the object directly from S3. The URL expires after a set time — no valid URL, no access.

Flow:

1. Client requests a file
   GET /files/{file_id}
   Authorization: Bearer <access_token>   ← your normal JWT auth

2. Backend checks permission
   → Does this user own this file? Can they access it?
   → If yes, generate a presigned S3 GET URL for the key
   → Short TTL: 5–15 minutes
   → Return { download_url } to client — do NOT return the raw S3 key

3. Client fetches the object directly from S3
   GET {download_url}   ← file bytes come from S3, not your servers
   → URL includes signature — S3 validates it, no auth header needed
   → After TTL expires, URL returns 403 — even if someone shares the link

Why not just make the bucket public?

Anyone with any S3 URL can access any object — no per-user access control
No way to revoke access to a specific file or user
Exposes your entire bucket structure

TTL tradeoffs:

TTL	Use case
1–5 min	Sensitive documents, medical records, payment receipts
15 min	Standard user files, photos
1–24 hours	Downloads the user explicitly triggered (export, backup)
7 days	Shared links the user sends to others

Never return the S3 key to the client. Return only the presigned URL. If the client knows the key pattern (users/123/profile.jpg) they can guess other users' keys.

Presigned URLs vs CloudFront signed URLs:

	S3 Presigned URL	CloudFront Signed URL
Served from	S3 directly	CDN edge node (fast, cached)
Use when	One-off downloads, low traffic	High-traffic files, media, anything needing CDN speed
Revocation	URL expires naturally	Can invalidate at CDN layer
Cost	S3 GET cost per request	CloudFront pricing, but fewer origin hits

For a profile photo viewed thousands of times a day — use CloudFront signed URLs so the object is cached at the edge. For a tax document downloaded once a year — a plain S3 presigned URL is fine.

Generating a presigned URL requires knowing whether the requesting user actually has permission. That's handled by the access control layer.

File Access Control — Data Modelling

The download pattern checks "does this user have access?" before issuing a presigned URL. This section covers how you store and query those permissions efficiently.

Naive approach — sharelist embedded in file metadata:

Store a list of permitted user IDs directly on the file record:

Files table:
| fileId | ownerId | s3Key       | sharedWith          |
|--------|---------|-------------|---------------------|
| f1     | user1   | docs/f1.pdf | [user2, user3, ...] |

Works for small scale. Breaks quickly:

To find all files shared with user2, you must scan every file row and inspect the list — no efficient query
The list grows unbounded as more users are added
Updating sharing permissions means rewriting the entire array

Normalised approach — SharedFiles mapping table:

Extract sharing into its own table. Each row is one (user, file) permission pair.

SharedFiles table:
| userId (Partition Key) | fileId (Sort Key) |
|------------------------|-------------------|
| user1                  | fileId1           |
| user1                  | fileId2           |
| user2                  | fileId3           |

userId and fileId together form a composite primary key — the same user can have many files, the same file can be shared with many users. The file metadata no longer needs a sharelist.

Queries this enables:

Query	How
All files shared with user2	Query SharedFiles WHERE userId = user2
All users who can access fileId1	Query SharedFiles WHERE fileId = fileId1
Does user2 have access to fileId1?	Point lookup: userId=user2, fileId=fileId1
Revoke user2's access to fileId1	Delete row: userId=user2, fileId=fileId1

All are efficient indexed lookups — no table scans, no array manipulation.

SQL vs DynamoDB:

	SQL	DynamoDB
Primary key	Composite primary key on (userId, fileId)	userId as partition key, fileId as sort key
Query by user	`SELECT * FROM SharedFiles WHERE userId = ?`	Query on partition key — O(1)
Query by file	`SELECT * FROM SharedFiles WHERE fileId = ?` (needs index)	Requires a Global Secondary Index (GSI) on fileId

In DynamoDB, querying by fileId (all users who can access a file) requires a GSI — partition key fileId, sort key userId. Add it if that query is needed (e.g. showing a file's collaborator list).

How it connects to the download flow:

GET /files/{fileId}
Authorization: Bearer <access_token>   ← identifies userId

Backend:
  1. Extract userId from JWT
  2. Point lookup: SharedFiles WHERE userId=X AND fileId=Y
     (or check ownerId on Files table — owner always has access)
  3. If row exists → generate presigned S3 GET URL → return to client
  4. If no row → 403 Forbidden

The permission check is a single indexed lookup before the presigned URL is issued. S3 itself stays fully private — it never makes access decisions, it just honours the signed URL your backend produced.

Owner access: The file owner doesn't need a row in SharedFiles. Check ownerId on the Files table separately, or insert an owner row at upload time — whichever keeps the query path simpler.

Interview answer: "Access control is a normalised SharedFiles table — one row per (userId, fileId) pair. Before issuing any presigned URL, we do a point lookup on that table. If no row exists, we return 403 — S3 never sees the request. The file owner is checked separately on the Files table. In DynamoDB, we use userId as the partition key and fileId as the sort key, with a GSI on fileId for listing a file's collaborators."

Access control determines who can request a URL. Security layers determine how the file itself is protected in storage and transit.

File Security

Four layers that work together to keep files secure.

Layer 1 — Encryption in transit (HTTPS)

All communication between client and server uses HTTPS (TLS). Data is encrypted in the network pipe — an attacker intercepting traffic sees only ciphertext. This applies to every request: login, presigned URL generation, and the direct S3 upload/download.

This is a baseline requirement — not a design decision. Every modern file storage system uses it.

Layer 2 — Encryption at rest (S3 server-side encryption)

Objects stored in S3 are encrypted at rest using AES-256. When an object is uploaded, S3 generates a unique encryption key for it, encrypts the object with that key, and stores the key separately. If an attacker gains physical access to the storage hardware, they get encrypted bytes with no key — the file is unreadable.

S3 offers three options:

Option	Who manages the key	Use when
SSE-S3	AWS manages everything	Default — no operational overhead
SSE-KMS	AWS KMS holds the key, you control access policy	Audit trail needed, fine-grained key access control
SSE-C	You provide the key on every request	Regulatory requirement to hold your own keys

For most systems SSE-S3 is sufficient. SSE-KMS adds a CloudTrail audit log of every decryption — useful for compliance. SSE-C means you manage key rotation and storage yourself.

Layer 3 — Access control (who can request a URL)

The SharedFiles table (covered above) is your ACL. Before generating any presigned URL, check that the requesting user has a row in SharedFiles for that file (or is the owner). If not — 403, no URL issued.

This is the gate. Layers 1 and 2 protect the bytes in storage and transit. This layer controls who can even get a URL to start.

Layer 4 — Signed URLs: TTL, bearer token risk, and higher security options

A signed URL is a bearer token — anyone who holds a valid, unexpired URL can download the file. No auth header required. S3 doesn't know or care who is making the request; it only checks that the URL signature is valid and the TTL hasn't expired.

This means: if an authorised user intentionally or accidentally shares a presigned URL (posts it publicly, sends it to the wrong person), that URL works for anyone until it expires. A short TTL (5–15 min) limits the exposure window but doesn't fully prevent sharing.

How CloudFront signed URL validation works:

Your backend generates a URL signed with your private key — incorporating the URL path, expiration timestamp, and optionally extra restrictions
The signed URL is returned to the authorised user
When the user (or anyone with the URL) hits CloudFront, CloudFront verifies the signature using your registered public key
CloudFront checks: valid signature? not expired? restrictions match? If yes → serve content. If no → 403

The private key never leaves your backend. CloudFront only holds the public key. This means only your backend can generate valid signed URLs — S3 or CloudFront cannot be fooled with a forged URL.

Hardening options for higher-security scenarios:

Option	How it works	Use when
Short TTL	URL expires in 5–15 min	Default — limits accidental sharing window
IP binding	Signed URL includes the requester's IP — only valid from that IP	Corporate internal tools where users have fixed IPs
Auth cookie alongside URL	Require a valid session cookie in addition to the signed URL	Streaming media — prevents URL sharing across users
One-time URLs	Backend marks URL as used after first access — subsequent requests rejected	Highly sensitive documents, payment receipts

IP binding and auth cookies are CloudFront features. One-time URLs require your own tracking layer (Redis key marked used on first hit).

Interview answer: "Security is four layers. HTTPS encrypts everything in transit — baseline, non-negotiable. S3 server-side encryption (SSE-S3 by default, SSE-KMS for audit trails) protects objects at rest. The SharedFiles table is our ACL — no presigned URL is issued without a permission check. Signed URLs are bearer tokens, so we keep TTLs short (5–15 min). For higher-security use cases we use CloudFront signed URLs with IP binding or require an auth cookie alongside the URL."

With uploads, processing, delivery, and security covered, the final challenge is keeping files consistent across multiple devices.

File Sync Across Devices

How files stay consistent across a user's devices — the core problem behind Dropbox, Google Drive, and iCloud. The remote server is the source of truth. Every device syncs to and from it.

There are two directions to handle:

Device A  ──→  Remote Server  ──→  Device B
  (local change uploaded)    (change pushed/pulled down)

Local → Remote (upload on change)

A client-side sync agent runs in the background on the device. It monitors the local sync folder for changes using OS-specific file system events:

Windows: FileSystemWatcher
macOS: FSEvents
Linux: inotify

When a change is detected the agent:

Queues the modified file locally (so changes survive a crash or network drop)
Uploads the file to the server via the presigned URL upload pattern
Sends updated metadata to the backend (filename, size, updatedAt, checksum)

Conflict resolution — last write wins: If two devices edit the same file and both upload, the server keeps whichever write arrived last (based on updatedAt timestamp). The earlier write is overwritten. Simple and predictable. More sophisticated systems (Google Docs) use operational transforms to merge concurrent edits — that is out of scope here.

The tradeoff is silent data loss — if two users edit the same file concurrently, the earlier write is permanently overwritten with no warning and no recovery path. This is acceptable for many use cases but should be an explicit product decision, not an accidental default.

Chunking for large files: Don't upload the whole file on every change. Split the file into chunks, compute a fingerprint per chunk, and only upload chunks that changed. This is delta sync — a 1GB file with a small edit uploads one or two chunks, not 1GB. See Content-Defined Chunking below for why fixed-size chunks don't work well for this.

Remote → Local (receiving changes)

Each client needs to know when another device (or another user) changes a file so it can pull the update down. Two approaches:

Option 1 — Polling: Client periodically calls GET /files/changes?since={lastSyncTimestamp}. Server queries the DB for any files this user has access to where updatedAt > lastSyncTimestamp and returns the diff.

Simple to implement
Guaranteed to catch every change (stateless, timestamp-based)
Slow to detect changes (latency = polling interval)
Wastes bandwidth when nothing has changed

Option 2 — WebSocket or SSE: Server maintains a persistent connection per device/session and pushes a notification the moment a file changes. Client receives the event and pulls the updated file immediately.

Near real-time updates
No wasted requests when nothing changes
More complex — connections drop, messages can be missed, server must track open connections

Hybrid approach (best of both):

One WebSocket (or SSE) connection per device session for real-time push. Plus periodic polling as a safety net to catch anything missed during a dropped connection.

[Device]
    │
    ├── WebSocket connection to server (always open)
    │     Server pushes: { fileId, updatedAt, changeType }
    │     Device pulls the changed file immediately on receipt
    │
    └── Polling every few minutes as fallback
          GET /files/changes?since={lastSyncTimestamp}
          Catches any events missed during WebSocket downtime

The polling interval is infrequent (every 2–5 minutes) — it's a safety net, not the primary channel. The WebSocket handles the real-time feel; polling guarantees eventual consistency.

updatedAt as the sync cursor: Every file record has an updatedAt timestamp. The client stores the timestamp of its last successful sync. On each poll it sends that timestamp and gets back only the files that changed since then — not the full file list.

Full sync flow — end to end:

[Device A edits file.txt]
    │
    ├── FSEvents / FileSystemWatcher detects change
    ├── Sync agent queues file locally
    ├── POST /uploads → gets presigned S3 URL
    ├── Uploads file directly to S3
    ├── PATCH /files/{fileId} → updates metadata (updatedAt, checksum, s3Key)
    │
    ▼
[Server stores new metadata in DB]
    │
    ├── Looks up all devices/sessions for this user (and shared users)
    ├── Pushes change notification over WebSocket: { fileId, updatedAt }
    │
    ▼
[Device B receives WebSocket event]
    │
    ├── Sees fileId it has locally
    ├── Requests presigned S3 GET URL from backend
    ├── Downloads updated file directly from S3
    └── Replaces local copy, updates lastSyncTimestamp

Interview answer: "The sync agent monitors the local folder using OS file system events (FSEvents on macOS, FileSystemWatcher on Windows). On change, it queues the file locally and uploads directly to S3 via presigned URL — never through our backend. Each device holds one WebSocket connection; the server pushes change notifications in real time. Clients also poll every few minutes as a reliability fallback to catch anything missed during a dropped connection. Conflict resolution is last-write-wins on updatedAt — simple and predictable, but a deliberate product tradeoff since concurrent edits silently overwrite earlier writes. For large files we use CDC chunking and only upload changed chunks."

Upload, Download, and Sync Optimizations

How to make each of the three operations as fast as possible, beyond the basics.

Operation	Already covered	Additional optimizations
Download	CDN caches file closer to user	—
Upload	Chunking + parallel parts maximise bandwidth	Compression, CDC
Sync	Only upload changed chunks (delta sync)	CDC makes delta sync actually work

Content-Defined Chunking (CDC) — making delta sync effective

Fixed-size chunking (e.g. every 5MB) has a fatal flaw for delta sync: if you insert a single byte near the beginning of a file, every chunk boundary after that point shifts. All subsequent chunks produce different fingerprints — even though their content barely changed. Delta sync detects them all as "changed" and re-uploads them. For a 1GB file with a one-byte edit, you could end up re-uploading nearly the whole file.

Content-Defined Chunking (CDC) fixes this by determining chunk boundaries based on the file's content rather than byte position. A rolling hash (Rabin fingerprinting) scans the file byte by byte and declares a boundary whenever the hash matches a pattern. Because boundaries are anchored to content, a small edit only affects the chunks immediately surrounding the change — the vast majority of chunks remain identical and are skipped.

Fixed-size chunking — insert 1 byte near start:
Before: [chunk1][chunk2][chunk3][chunk4][chunk5]
After:  [chunk1'][chunk2'][chunk3'][chunk4'][chunk5']  ← all shifted, all "changed"

CDC — insert 1 byte near start:
Before: [chunk1][chunk2][chunk3][chunk4][chunk5]
After:  [chunk1'][chunk2][chunk3][chunk4][chunk5]      ← only chunk1 affected

This is how Dropbox achieves efficient delta sync in practice. Without CDC, chunking is still useful for parallel uploads and resumability — but it doesn't give you true delta sync.

Client-side compression — fewer bytes, faster transfers

Since uploads go directly from the client to S3 (no bytes through your backend), compression must happen entirely on the client. The client compresses before uploading; S3 stores the compressed bytes as-is. On download, the client decompresses after retrieving. The backend stays completely out of the data path.

When compression is worth it:

Not all files compress well. Compression only helps if the transfer time saved outweighs the CPU time spent compressing and decompressing.

File type	Compression ratio	Worth it?
Text, code, CSVs, logs	High — 5GB → ~1GB	Yes, usually
Office documents (docx, xlsx)	Moderate	Depends on size
Already-compressed media (JPEG, MP4, MP3)	Near zero — 1–3%	No
ZIP, RAR, already-compressed archives	Near zero	No

A reasonable default: compress if the file is text-based (MIME type text/*, application/json, application/xml, application/csv) and larger than 1MB. Skip compression for JPEG, PNG, MP4, MP3, ZIP, and other already-compressed formats. On mobile, also skip compression if the measured upload bandwidth exceeds 10 Mbps — the CPU cost outweighs the transfer saving.

Always compress before encrypting — encryption introduces artificial randomness into the data, making it nearly incompressible. Compressing after encrypting achieves almost nothing. The correct order is always: compress → encrypt → upload.

Compression algorithms:

Algorithm	Compression ratio	Speed	Best for
Gzip	Good	Moderate	Universal support everywhere
Brotli	Better than Gzip for text	Moderate	Web, modern browsers (Chrome, Firefox, Edge, Safari)
Zstandard (zstd)	Comparable to Gzip, tunable	Very fast	Client-side compression — fast enough to not block UX

For a Dropbox-style system where compression runs on the client device, Zstandard is a strong default — it compresses and decompresses significantly faster than Gzip at comparable ratios, and its speed/ratio tradeoff is tunable via compression level.

Video Streaming Pipeline

How to store, process, and deliver video at scale — the core building blocks behind YouTube, Netflix, and TikTok.

The Core Problem

Video has three properties that make it fundamentally different from other content:

Size — a 10-minute 1080p video is 1–2 GB raw. You cannot serve the original file to users.
Heterogeneous clients — a 4K TV and a 3G phone need different quality levels of the same video.
Seek — users skip to arbitrary timestamps. You cannot stream one giant file; you need seekable chunks.

The answer to all three: transcode into multiple resolutions, split into small segments, serve from CDN.

Upload and Transcoding Pipeline

The full flow from a creator uploading a video to a viewer playing it.

[Creator Device]
    │  direct upload via presigned URL (no bytes through backend)
    ▼
[S3 Raw Bucket]    ← private, original file, no CDN
    │  S3 event notification on object creation
    ▼
[SQS / Kafka]      ← job queue, one message per upload
    │  fan-out: one job per output resolution
    ▼
[Transcoding Workers]   ← stateless fleet, auto-scales (EC2 Spot or AWS Elemental)
    │  FFmpeg converts raw → multiple resolutions + formats
    │  Splits each output into 6-second segments
    │  Writes HLS manifest (.m3u8) per resolution
    ▼
[S3 Processed Bucket]   ← CDN-accessible
    │
    ▼
[CloudFront CDN]        ← viewers never touch S3 directly

Why fan-out per resolution: Transcoding is CPU-bound and slow (a 1-hour video at 4K can take 30–60 minutes). Parallelizing across workers — one per resolution — cuts wall-clock time proportionally. All workers read from the same raw file in S3.

Output resolutions produced:

Resolution	Bitrate	Target
360p	~0.5 Mbps	3G mobile
480p	~1 Mbps	4G mobile
720p	~2.5 Mbps	Laptop / WiFi
1080p	~5 Mbps	Desktop / Smart TV
1440p / 4K	~15 Mbps	Premium / large screen

Interview note on storage: You store one raw file and N processed versions. At YouTube scale (500 hours uploaded per minute), this is significant — tiered storage (S3 Standard for hot, S3 Glacier for old videos) is necessary. Long-tail content (old videos rarely watched) can be archived and transcoded on-demand.

Adaptive Bitrate Streaming (ABR)

The mechanism that lets the player switch quality mid-stream without buffering or user action.

The two dominant protocols:

	HLS (HTTP Live Streaming)	DASH (Dynamic Adaptive Streaming over HTTP)
Origin	Apple (2009)	MPEG standard (2012)
Supported by	iOS required, universal support	Chrome, Android, Smart TVs, all non-Apple
Manifest format	`.m3u8` (M3U playlist)	`.mpd` (XML)
Segment format	`.ts` (MPEG-TS) or fMP4	fMP4
Segment length	6–10 seconds typical	2–10 seconds
Use when	Must support iOS natively	Non-Apple / cross-platform

In practice: Most platforms produce both HLS and DASH from the same segments. The player picks the appropriate manifest.

How ABR works — the manifest structure:

master.m3u8  (master manifest — lists all quality levels)
  │
  ├── 360p.m3u8  (playlist for 360p — lists all segments)
  │     ├── seg_000.ts
  │     ├── seg_001.ts
  │     └── ...
  │
  ├── 720p.m3u8
  │     ├── seg_000.ts   ← same timestamps, different quality
  │     └── ...
  │
  └── 1080p.m3u8
        └── ...

Playback sequence:

Player fetches master manifest — gets list of all quality levels
Player picks starting quality (based on current bandwidth estimate)
Player fetches that quality's segment playlist, downloads segment 0, starts playing
After each segment, player re-estimates bandwidth: if download was fast → step up quality; if slow → step down
Quality switches happen at segment boundaries — seamless, no seek interruption

Why segments matter for seek: User jumps to 4:30 → player calculates which segment contains that timestamp → fetches only that segment from CDN. No need to download the whole video or re-establish a stream.

CDN Strategy for Video

Video is the highest-egress workload a CDN will serve. How you cache matters.

Pull CDN (default for most platforms):

CDN edge node caches segments on first request
Cold start: first viewer in a region causes a cache miss back to S3 (1 round trip penalty)
Works well for long-tail content — no point pre-loading videos nobody will watch

Push CDN (for viral/live content):

Proactively push segments to all edge nodes before viewers arrive
Used when you know content will spike immediately (live events, major releases)
Netflix does this for popular titles — pre-positions content at edge nodes in each region the night before a big release

Segment caching vs full-file:

Always cache at the segment level, never the full video file
Segments are small (1–5 MB), fit cleanly in edge cache
Seek-driven requests are cache-efficient — only the requested segment is fetched

Popular vs long-tail — the 90/10 rule:

Top ~10% of videos drive ~90% of traffic
Keep popular content at edge nodes with high TTL (1–7 days)
Long-tail content: pull CDN with aggressive cache eviction, or serve from a closer regional origin

Reducing CDN costs:

CloudFront signed URLs for premium/paid content — prevents sharing the URL
Origin Shield: single regional cache tier in front of S3, absorbs cache misses from all edge nodes, reduces S3 GET costs and request volume

Metadata and State

The video file pipeline handles bytes. A separate metadata service handles everything else.

[Metadata DB — PostgreSQL]
  - video_id, title, description, creator_id
  - status: UPLOADING → PROCESSING → READY → FAILED
  - duration, resolution list, thumbnail URL
  - view count, like count (denormalized counters)

[Search Index — Elasticsearch]
  - title, description, tags, transcript (if captions generated)
  - Synced from metadata DB via CDC

[Recommendation / Feed Service]
  - Reads from separate engagement DB (views, watch time, likes)
  - Not part of the video pipeline

Status transitions: The transcoding worker updates status in the metadata DB when it finishes each stage. Frontend polls or receives a webhook when status hits READY. The CDN URL is only surfaced to viewers once status = READY — never expose a partial or raw URL.

What-If Scenarios for Video

Transcoding worker crashes mid-job: At-least-once delivery from SQS — job is re-queued after visibility timeout expires. Worker must be idempotent (re-processing same video produces same segments). Use a job-level status in DB: PENDING → IN_PROGRESS → DONE. If a worker picks up an IN_PROGRESS job older than 2× expected duration, treat it as crashed and restart.

Upload is corrupt or invalid: First stage of pipeline is a validation worker: check container format, codec, minimum duration. On failure, update status to FAILED, emit event, notify creator. Do not proceed to transcoding.

CDN cache stale after re-transcode: If a video is re-transcoded (quality fix, policy violation edit), old segments may be cached at edge nodes. Trigger a CDN invalidation on the affected path prefix. For HLS, the manifest file is the key — once the manifest is invalidated and re-fetched, players will pull new segments.

Storage costs blow up: Apply tiered storage: move videos with 0 views in 90 days to S3 Glacier. On-demand transcode from raw if a Glaciered video suddenly gets requested (rare, acceptable latency for cold content). Delete raw files after transcoding is confirmed — you can always re-transcode from the raw if needed, but raw files are 10–50× larger than processed output.

Interview one-liner: "Creator uploads directly to S3 via presigned URL. S3 triggers a transcoding job queue. Workers transcode in parallel to multiple resolutions, split into 6-second HLS segments, write to processed S3 behind CloudFront. The player fetches a manifest listing all quality levels and switches resolution per segment based on bandwidth — that's adaptive bitrate. Viewers never touch origin storage."

Financial & Payments Patterns

Domain-specific patterns for guaranteeing payment integrity, consistent state machines, and reliable event delivery to merchants.

PaymentIntent State Machine

A PaymentIntent is a high-level payment request that tracks the full lifecycle of a charge. It owns the state machine and enforces idempotency for retries. Transactions are the individual money movements it generates — one PaymentIntent can have many Transactions (retries, partial payments, refunds).

State machine:

created → authorized → captured
                    ↘ cancelled
        ↘ failed
                         captured → refunded (partial or full)

Why separate PaymentIntent from Transaction:

PaymentIntent = merchant-facing, stable reference for the business operation
Transaction = polymorphic money-movement record (charge, capture, refund, reversal, transfer)
Separation simplifies reasoning about state, audits, and retries without diving into ledger details

Idempotency enforcement: PaymentIntent ID is the idempotency key for the charge. Retrying a payment attempt against the same PaymentIntent returns the existing result — no double charge.

Two-Phase Event Model

Production pattern (used by Stripe) for guaranteeing consistency between transaction state and downstream event consumers.

NFR triggers: Durability, consistency, financial integrity.

The two events per transaction:

Transaction Created Event — emitted when the transaction service begins processing, before the DB write completes. If emission fails → transaction enters locked/failed state, can be retried safely.
Transaction Completed Event — emitted after the DB write successfully completes. If emission fails → transaction enters locked state, further updates blocked until completion event is successfully emitted.

Why two phases enable smart retries:

On retry, compare the "created" event data with actual DB state
If DB write already occurred → only re-emit the completion event (no full transaction retry needed)
Prevents double charges when the network drops between the DB write and the response

Production reality: Most missing "completed" events are due to external payment network timeouts before the DB write — not emission failures after successful writes. The two-phase model handles both cases cleanly.

Webhook Delivery

Server-to-server HTTP POST notifications sent to a merchant's configured endpoint when payment state changes. Different from WebSockets/SSE — webhooks are server-to-server, not server-to-browser.

NFR triggers: Availability (reliable delivery), fault tolerance (retry on failure).

How it works:

CDC streams DB changes to Kafka
Webhook service consumes events, checks if merchant has subscribed to that event type
Signs payload with shared secret (HMAC) so merchant can verify authenticity
POSTs to merchant's configured HTTPS endpoint
Records delivery attempt with status

Retry strategy — exponential backoff:

Attempt 1: immediate
Attempt 2: 5 seconds
Attempt 3: 25 seconds
Attempt 4: 125 seconds
...up to ~1 hour max

Merchant side: Must return 2xx to acknowledge receipt. Any non-2xx → webhook service treats as failure and retries. Merchant must handle idempotency (same event may be delivered more than once on retry).

Payload signing: Webhook payload signed with HMAC using shared secret. Merchant verifies signature before processing — prevents spoofed webhook events.

Production considerations: Idempotency keys on delivery, webhook logs for debugging, merchant dashboard to replay failed webhooks, payload versioning, adaptive rate limiting to avoid overwhelming slow merchant endpoints.

Reconciliation

Periodically comparing your internal transaction records against the payment processor's records to detect discrepancies.

NFR triggers: Durability, consistency (payments, banking).

Why it's needed: Networks fail, bugs happen, systems crash mid-transaction. Your DB might say a payment succeeded while the processor says it failed, or vice versa.

How it works:

At end of day (or hourly), fetch all transactions from payment processor API.
Compare against your internal records row by row.
Flag mismatches for investigation or auto-correction.
Retry failed-but-charged transactions. Refund charged-but-not-delivered.

Real example: Stripe, PayPal, and every bank runs nightly reconciliation jobs. Airline booking systems reconcile seat inventory against payment records after every batch.

Saga Pattern (Distributed Transactions)

Manages a multi-step transaction across services by breaking it into a sequence of local transactions, each with a compensating rollback action if a later step fails.

NFR triggers: Consistency across microservices, no distributed locks, durability in distributed systems.

Why not two-phase commit (2PC): 2PC locks resources across all participants until all agree — too slow and brittle at scale. One slow service blocks the entire transaction. Saga is async and each step is independent.

The Two Styles: Choreography vs Orchestration

These are the two ways to implement Saga. The choice determines whether you write the coordination logic yourself or hand it to a framework.

Choreography — services react to events, no central coordinator:

Each service listens to Kafka, does its job, publishes the next event.
No one is "in charge" — the workflow emerges from the event chain.

Order Service        → publishes OrderCreated
  Inventory Service  → listens, reserves stock   → publishes StockReserved
    Payment Service  → listens, charges card      → publishes PaymentCharged
      Shipping Svc   → listens, creates shipment  → publishes ShipmentCreated

If Payment fails:
  Payment Service    → publishes PaymentFailed
    Inventory Svc    → listens, releases stock    → publishes StockReleased
      Order Service  → listens, cancels order     → publishes OrderCancelled

You write this yourself — each service just has Kafka consumers and producers. No framework needed.

Orchestration — a central coordinator tells each service what to do:

Saga Orchestrator (your coordinator service or Temporal workflow):

  Step 1: call Inventory Service  → reserve stock
  Step 2: call Payment Service    → charge card
  Step 3: call Shipping Service   → create shipment

  If Step 2 fails:
    compensate Step 1: call Inventory Service → release stock
    mark order as failed

The orchestrator owns the workflow state. Services are dumb — they just do what they're told and report back.

Do You Write the Code Yourself?

Choreography — yes, you write it yourself. Each service has a Kafka consumer for incoming events and a producer for outgoing events. The "saga" is just the sum of all these listeners.

// Inventory Service — choreography style

on message from "order-created":
    reserved = inventory.reserve(skuId, qty)
    if reserved:
        publish "stock-reserved" { orderId, skuId }
    else:
        publish "stock-reservation-failed" { orderId, reason: OUT_OF_STOCK }

on message from "payment-failed":
    inventory.release(orderId)          // compensating transaction — undo the reservation
    publish "stock-released" { orderId }

No framework. Just Kafka listeners and a clear rule: every service that does work must also listen for failure events and undo its work.

Orchestration — you either write a coordinator or use a framework.

Hand-rolled orchestrator:

// Coordinator service — tracks state in a DB table

function runOrderSaga(order):
    state = db.create(orderId, status=STARTED)
    try:
        inventoryService.reserve(order)
        state.update(INVENTORY_RESERVED)

        paymentService.charge(order)
        state.update(PAYMENT_CHARGED)

        shippingService.createShipment(order)
        state.update(COMPLETED)

    on PaymentFailure:
        inventoryService.release(order)   // compensate
        state.update(FAILED)

    on ShippingFailure:
        paymentService.refund(order)      // compensate
        inventoryService.release(order)   // compensate
        state.update(FAILED)

This works but has a fatal flaw: if the coordinator crashes between steps, you have no idea where it left off. You need to rebuild state from the DB and figure out what to retry. This is exactly the problem Temporal solves — it persists every step automatically.

Compensating Transactions — What They Actually Are

A compensating transaction is not a rollback. It's a new forward action that undoes the business effect of a previous step.

Step done:                    Compensation:
───────────────────────────────────────────────────────
Reserve inventory             Release inventory
Charge payment                Issue refund
Create shipment               Cancel shipment + notify carrier
Send confirmation email       Send cancellation email
Notify driver (Uber)          Un-notify driver, reassign ride

Key constraint: Compensations must be idempotent. The coordinator may call them more than once (crash + retry). release(orderId) called twice must be safe — second call is a no-op if stock is already released.

Failure Modes — What Can Go Wrong

1. Compensating transaction itself fails:

Payment charged → Shipping fails → try to refund → refund call times out

Fix: retry with backoff. Refunds are idempotent (refund twice = still one refund).
     If retries exhaust, write to a dead-letter topic for manual review.
     Never silently swallow a failed compensation — money is involved.

2. Service crashes mid-saga (choreography):

Inventory reserved → service crashes → PaymentCharged event never heard

Fix: outbox pattern on every step. Event is in the outbox before the crash,
     worker publishes it on restart. Saga continues where it left off.

3. Out-of-order events:

StockReleased arrives before StockReserved (network reordering)

Fix: idempotency + state machine. Each service tracks its current state per orderId.
     If StockReleased arrives and state is not RESERVED, ignore it.

4. Partial visibility to users:

Saga is mid-flight — inventory reserved but payment not yet charged.
User queries order status — what do you show?

Fix: show intermediate states explicitly. "Order processing — payment pending."
     Never show CONFIRMED until the saga fully completes.

Choreography vs Orchestration — When to Use Each

	Choreography	Orchestration
Who coordinates	Nobody — services react to events	Central coordinator or framework
Code ownership	Each team owns their service's listeners	One team owns the orchestrator
Visibility	Hard — workflow is implicit in event chains	Easy — workflow is explicit in one place
Coupling	Services coupled to event schemas	Services coupled to orchestrator API
Failure tracing	Hard — must correlate events across services	Easy — orchestrator has full state
Best for	Simple 2–3 step workflows, small teams	Complex workflows, many services, long-running

Rule of thumb: Start with choreography. Switch to orchestration (or Temporal) when you find yourself building a "where is this saga at?" dashboard.

Real Systems — Choreography vs Orchestration

Choreography fits when the flow is a linear chain and each service only needs to know what happened, not what to do next.

System	Flow	Why choreography fits
E-commerce order pipeline (simple)	Order → Inventory → Payment → Shipping → Email	Linear chain, no branching, each service reacts to the previous event
LinkedIn / Twitter feed fanout	User posts → Fanout → Notification → Analytics	Each service listens independently, no one needs to know the others exist
IoT sensor pipeline	Sensor event → Validate → Aggregate → Store → Alert	High throughput, simple chain, no compensation needed
Audit / event logging	Action → audit log, analytics, search index each react	Classic pub/sub — multiple independent consumers, no coordination
Simple notification system	User action → Email service, Push service, SMS service each listen	No coordination needed between notification channels

Orchestration fits when the flow has branching, timeouts, human delays, or you need exact visibility into where the workflow is.

System	Flow	Why orchestration fits
Uber ride matching	Match driver → 10s timeout → try next → try next → give up	Timeout + fallback branching — Temporal was built for exactly this
Stripe payment processing	Charge → fraud check → settle → webhook	Exact state required at all times; dispute resolution has human delays (bank response)
Amazon order fulfillment	Reserve → pick → pack → ship → notify	Out-of-stock branches, address invalid branches, carrier failure — too complex for choreography
Airbnb booking	Reserve dates → charge → notify host → wait for host approval → confirm guest	Human delay up to 24h — choreography cannot model waiting for a human
DoorDash / food delivery	Place order → restaurant accepts (timeout) → assign driver (timeout) → pickup → delivery	Multiple human-in-the-loop timeouts at every step
Bank loan approval	Credit check → risk scoring → manual underwriter review → approval → disbursement	Days-long workflow, human steps, full audit trail required
Travel booking (flight + hotel + car)	All three must succeed or all roll back	Orchestrator coordinates all-or-nothing across three independent services
Healthcare claim processing	Submit → eligibility → prior auth → adjudication → payment	Complex branching at every step, weeks-long timeline

The signal in an interview:

Interviewer says...	Reach for...
"notify multiple services when X happens"	Choreography
"user has N seconds to do Y, then try Z, then give up"	Orchestration + Temporal
"all steps must succeed or all roll back"	Orchestration
"we need to know exactly what state this order is in"	Orchestration

Frameworks Available

You do NOT have to hand-roll Saga in production. Mature frameworks exist.

Temporal (orchestration — most powerful): Persists every workflow step — crashes resume exactly where they left off. No hand-rolled state table. Full coverage in the Durable Execution section below. Used by: Uber, Netflix, Stripe, Airbnb.

AWS Step Functions (orchestration — managed, AWS-native): Define workflows as JSON/YAML state machines. AWS manages state persistence. Full coverage in the Durable Execution section below. Used by: Amazon order fulfillment, DoorDash.

Axon Framework (choreography + orchestration — Java-native):

Built for Java/Spring. Handles both Saga styles via annotations.
@SagaEventHandler for choreography, @CommandHandler for orchestration.
Built-in saga state persistence to any DB.
Used in enterprise Java shops (banking, insurance).

Eventuate Tram (choreography — lightweight):

Open source, Spring-native.
Handles Saga state machine, compensating transactions, and the outbox pattern automatically.
Lower learning curve than Temporal for simple flows.

Conductor (orchestration — Netflix OSS):

Netflix's open-source workflow engine.
Define workflows as JSON, workers are microservices that poll for tasks.
Good fit if you're already Netflix-stack (Java, Spring).

Framework decision:

Simple 2–3 step saga, small team    → hand-roll choreography with Kafka
Complex workflow, crash recovery    → Temporal
AWS-native, Lambda-based            → Step Functions
Java/Spring enterprise shop         → Axon Framework
Want outbox handled automatically   → Eventuate Tram

Outbox Pattern

Guarantees that a database write and a Kafka publish always stay in sync — even if the service crashes between them.

NFR triggers: Consistency between DB state and event stream, at-least-once delivery, audit trail durability.

The problem — two systems, no shared transaction:

Without outbox:

  Step 1: Write payment to ledger DB   ← succeeds
  Step 2: Publish to Kafka             ← service crashes here

Payment recorded. Order Service never got the event. Order stays in limbo — user paid but order not confirmed.

Swap the order?

  Step 1: Publish to Kafka             ← succeeds
  Step 2: Write payment to ledger DB   ← service crashes here

Opposite problem — Order Service confirmed the order, no payment record exists. Worse: confirmed order, no audit trail.

Root cause: you cannot make a DB write and a Kafka publish atomic. They are two completely different systems.

The fix — convert the Kafka publish into a DB write:

Single DB transaction:
  1. Write payment to ledger         ← your real data
  2. Write event to outbox table     ← a "to-do" note for the worker

Both in the same transaction → both succeed or both fail → always consistent.

Then the worker handles Kafka separately:

  Worker (runs every 5s):
    Read unpublished outbox rows → publish to Kafka → mark published

If the worker crashes mid-publish, the outbox row stays at published=false.
Next run retries. This is "at-least-once" — you may publish twice, but you
will always publish. Consumers need to handle duplicates (idempotency key).

The full guarantee:

DB write succeeded → Kafka will eventually get the message (even after retries)
DB write failed   → Kafka will never get the message (outbox row never created)

Payment processing sequence — correct order:

1. Receive payment request
2. Check idempotency key        ← already processed? return cached result
3. Call payment gateway         ← Visa / Mastercard / PayPal
4. Gateway returns APPROVED
5. Write to ledger DB           ← record the outcome
6. Write to outbox table        ← same transaction as step 5
7. Commit
8. Return success to caller

The ledger write happens AFTER the gateway responds — not before.
The gateway is the source of truth for whether money actually moved.

What if the service crashes between gateway response and ledger write (steps 4–5)?

The idempotency key rescues you. Next retry:

User retries with same idempotency key
  → Idempotency check: key not found (crashed before writing)
  → Call gateway again with same reference ID
  → Gateway returns "already processed, here's the result"  ← gateways support this
  → Write to ledger + outbox, return success

All three gateway outcomes and why each matters:

APPROVED  → write AUTHORIZED to ledger + outbox atomically → cache idempotency key → return success

DECLINED  → write DECLINED to ledger (audit trail) → cache key (retry returns same result,
            no gateway hit) → return declined

TIMEOUT   → write PENDING to ledger → return error "do not retry with new key"
            Background reconciliation job queries gateway by reference ID to resolve later.
            Never tell a user to retry on timeout with a new key — they may be charged twice.

Implementation — polling variant (no CDC required):

// Outbox table schema:
// id, topic, partition_key, payload, published, created_at

// Write side — same transaction as your domain write:
@Transactional
public void processPayment(PaymentRequest req) {
    ledger.save(new LedgerEntry(req.getPaymentId(), AUTHORIZED, req.getAmount()));
    outbox.save(new OutboxEvent("payment-events", req.getPaymentId(),
        toJson(new PaymentAuthorizedEvent(req.getPaymentId()))));
}

// Worker — polls every 5 seconds:
@Scheduled(fixedDelay = 5000)
public void publishOutbox() {
    List<OutboxEvent> pending = outbox.findByPublishedFalseOrderByCreatedAt();
    for (OutboxEvent event : pending) {
        try {
            producer.send(new ProducerRecord<>(event.getTopic(),
                event.getPartitionKey(), event.getPayload())).get();
            event.setPublished(true);
            outbox.save(event);
        } catch (Exception e) {
            log.error("Outbox publish failed, will retry: {}", event.getId(), e);
        }
    }
}

Implementation — CDC variant (Debezium, no polling worker):

DB write → Debezium reads binlog → publishes outbox row changes to Kafka

Pros: no polling worker to operate, lower latency (event-driven not scheduled)
Cons: Debezium is another piece of infrastructure to run and monitor

Use polling when: you want simplicity, Debezium isn't already in your stack
Use CDC when:    you need sub-second latency, Debezium is already running

Why consumers must be idempotent:

The worker publishes → Kafka acks → worker crashes before marking published=true. Next run publishes again. The consumer receives the same event twice. Without idempotency, the order gets confirmed twice or the inventory decremented twice. With an idempotency key on the consumer side, the second delivery is a no-op.

Outbox vs direct Kafka publish:

	Direct Publish	Outbox Pattern
Atomicity	None — DB and Kafka are independent	Guaranteed — outbox write is part of DB transaction
Crash safety	Crash between DB write and publish = lost event	Crash after DB commit = event retried by worker
Delivery	At-most-once (lost if crash)	At-least-once (duplicates possible, handled by idempotency)
Complexity	Simple	Outbox table + worker (or Debezium)
When to use	Acceptable to lose events (metrics, logs)	Event loss is unacceptable (payments, orders, inventory)

Real examples:

Stripe — outbox pattern for every payment event; worker retries ensure webhook delivery and reconciliation services always receive events.
Walmart — POS sale writes to inventory DB + outbox in one transaction; CDC variant with Debezium publishes inventory changes to Elasticsearch and cache invalidation topics.

Durable Execution — Temporal and AWS Step Functions

Manages long-running, multi-step workflows that must survive service crashes, human delays, and arbitrary timeouts — resuming exactly where they left off.

How it differs from Saga:

Saga — handles rollback when a step fails. State lives in your DB + event bus. You write the compensation logic.
Durable Execution — handles everything: timeouts, retries, state persistence, resume after crash. The framework owns the workflow state. You write business logic; the framework handles fault tolerance.

They're not mutually exclusive — a Temporal workflow can implement a Saga internally. But if you're reaching for Temporal, you likely don't need to hand-roll Saga logic.

The signal — human-in-the-loop processes:

Any time a human can introduce an unpredictable delay, you have a durable execution problem:

Driver has 10 seconds to accept a ride request
User has 30 minutes to confirm an email
Manager has 48 hours to approve an expense
Payment processor takes up to 5 minutes to respond

A standard request/response or even a Kafka-based flow can't cleanly handle "wait up to 10 seconds, then try the next driver, then try the next, then give up" without significant custom state management. Durable execution handles this natively.

How Temporal works:

Temporal persists every step of a workflow to its own database. If the worker running your workflow crashes mid-execution, another worker picks it up and replays from the last committed step — no state lost.

Temporal Workflow: MatchRide(rideRequest)

1. candidates = findNearbyDrivers(rideRequest.location)  // Activity
2. for each driver in candidates:
     result = sendRequestWithTimeout(driver, timeout=10s) // Activity + built-in timeout
     if result == ACCEPTED:
         return assignRide(rideRequest, driver)           // Activity
     // timeout or DECLINED → loop continues automatically
3. if no driver found: cancelRide(rideRequest)            // Activity

If the worker crashes between steps 2 and 3, Temporal replays from step 2 — no dropped ride request, no double-assignment. The 10-second timeout is declared in code, not implemented with Redis TTL + a background job.

Temporal vs AWS Step Functions:

	Temporal	AWS Step Functions
Hosting	Self-hosted or Temporal Cloud (managed)	Fully managed (AWS)
Workflow definition	Code (Go, Java, Python, TypeScript)	JSON/YAML state machine
Flexibility	High — full programming language	Limited — state machine model
Latency	Low (~ms per activity)	Higher (~100ms+ per state transition)
Pricing	Open source free / Temporal Cloud per action	AWS pay-per-state-transition
Best for	Complex business logic, human delays, long-running	Simple AWS-native orchestration, Lambda chains

When Temporal is the right answer:

Multi-step workflows with human delays (ride acceptance, checkout abandonment, approval flows)
Workflows that run for minutes, hours, or days
Complex branching and retry logic you'd otherwise build with cron jobs + state tables
Mission-critical flows where dropped requests directly cost revenue (ride-sharing, payments, order fulfillment)

When it's overkill:

Simple timeout → retry loop with no branching — Redis TTL + a background job is sufficient and far simpler
Short-lived operations (< 1 second) — just use async queues
Teams unfamiliar with the framework — Temporal has a learning curve; don't introduce it for trivial orchestration

The nuance for interviews: Redis TTL handles the lock expiry problem (driver didn't respond, lock auto-releases). But who retries with the next driver? If it's just "try the next driver in a list," a simple background job works. If the workflow has branching, multiple fallback levels, SLA tracking, and audit requirements — that's when you reach for Temporal. Name Redis first as the simpler solution, then offer Temporal as the escalation path if the interviewer pushes on fault tolerance.

Real examples:

Uber — built Cadence (open-sourced 2019), which Temporal forked and became the leading implementation. Used for ride matching, driver onboarding, surge pricing workflows.
Netflix — uses Temporal for content encoding workflows (multi-step, hours-long)
Stripe — uses workflow orchestration for payment retry and dispute resolution flows
AWS — Step Functions orchestrate Lambda-based order fulfillment, image processing pipelines

Two-Phase Commit (2PC)

Coordinates a transaction across multiple independent databases or services so they all commit or all abort — atomically, in one shot.

NFR triggers: Strong consistency across multiple databases, true atomicity across services.

The Two Phases

Phase 1 — Prepare (voting):

Coordinator → "Can you commit transaction T?" → Participant A
Coordinator → "Can you commit transaction T?" → Participant B
Coordinator → "Can you commit transaction T?" → Participant C

Each participant:
  - Writes the transaction to its WAL (durable on disk)
  - Acquires all necessary locks
  - Replies YES or NO

At this point resources are LOCKED. No other transaction can touch them.

Phase 2 — Commit or Abort:

All said YES → Coordinator sends COMMIT to all
Any said NO  → Coordinator sends ABORT to all

Each participant executes and releases locks.
Coordinator logs the final decision.

Why It's Slow

Every message is a network round-trip. With 3 participants:

Phase 1: Coordinator → A, B, C      (3 messages out)
         A, B, C → Coordinator      (3 votes back)
Phase 2: Coordinator → A, B, C      (3 messages out)
         A, B, C → Coordinator      (3 acknowledgements back)

= 4 network round-trips minimum
= 10–100ms added latency per transaction
= locks held for that entire duration

While locks are held, every other transaction that touches the same rows is blocked. Under high load this causes a queue of waiting transactions — throughput collapses.

The Coordinator Crash Problem — Why 2PC Is "Blocking"

This is the fatal flaw:

Phase 1 completes — all participants voted YES and locked their resources.
Coordinator crashes before sending Phase 2.

Participants are now stuck:
  - They voted YES so they cannot unilaterally abort
  - They have not received COMMIT so they cannot commit
  - Their locks are held — no other transaction can proceed

They wait. Until the coordinator recovers.

This is why 2PC is called a blocking protocol — it cannot make progress without the coordinator. A crashed or slow coordinator stalls the entire distributed transaction, and every row it touched is locked for the duration.

Where 2PC Is Actually Used

Despite the problems, it has a place:

Use case	Why 2PC is acceptable
PostgreSQL cross-DB transactions (`PREPARE TRANSACTION`)	Same datacenter, low latency, coordinator crash is rare, strong consistency required
Banking — money moves between two DB instances	Atomicity is non-negotiable, transaction volume is manageable, 50ms overhead is acceptable
XA transactions (Java EE / JTA)	Legacy enterprise apps coordinating DB + message broker atomically in one transaction
Single datacenter, 2–3 participants, low volume	Network is fast and reliable, coordinator failure is rare, throughput is not a concern

When NOT to Use 2PC

Microservices across the internet — network is unreliable, latency spikes are common, coordinator crash is a real risk
High-throughput systems — locks block concurrent transactions, throughput collapses under load
Services you don't own — can't implement the 2PC protocol on a third-party API (Stripe, Twilio, etc.)
Anything that must survive coordinator failure gracefully — use Saga instead

2PC vs Saga

	2PC	Saga
Atomicity	True atomic — all or nothing in one shot	Eventual — compensate after the fact
Locks	Holds locks across all participants for the full duration	No cross-service locks
Coordinator crash	Blocking — participants stuck waiting	No blocking — each step is independent
Intermediate state	Never visible — all or nothing	Visible mid-saga (partial completion)
Throughput	Low — locks + round trips under load	High — async, no blocking
Network requirement	Reliable, low-latency (same datacenter)	Tolerates unreliable network
Best for	Same-datacenter, 2–3 DBs, low volume, must be truly atomic	Microservices, high scale, eventual consistency acceptable

Real examples:

PostgreSQL — PREPARE TRANSACTION / COMMIT PREPARED implements 2PC natively for cross-database operations.
MySQL — XA transactions for coordinating with a message broker (e.g. commit DB write and message enqueue atomically).
Banking — intra-bank transfers between two internal DB instances where atomicity is legally required and volume is controlled.
Avoid at scale — Uber, Amazon, Netflix explicitly avoid 2PC across services. They use Saga (with Temporal or Axon) because 2PC's blocking behavior under failure is incompatible with their availability requirements.

Limits Quick Reference

Key numbers to memorize — throughput, capacity, and scaling thresholds for every major component in this doc.

Component	Key Numbers
Redis Cache	100GB RAM/instance, 100K–1M ops/sec, 512MB max value size
Redis Sorted Sets	4.3B members max, ~50–100 bytes/member overhead, O(log n) all ops
Redis Pub/Sub	Millions of msgs/sec, no persistence, no history for late subscribers
Redis Geo / Geohashing	Precision: 4 chars=20km, 5 chars=2.4km, 6 chars=610m, 7 chars=76m, 8 chars=19m. 4.3B member limit.
Kafka	Millions msgs/sec/broker, 1MB default msg size, ~4K partitions/broker, 5–30ms latency
WebSockets	10K–100K connections/server, ~2–8KB overhead per connection
Read Replicas	5–10 practical limit, < 100ms replication lag same AZ, seconds cross-region
DB Sharding	Consider sharding at ~5K–10K writes/sec or ~1–2TB data
Bloom Filter	1B items at 1% FPR ≈ 1.2GB RAM. No false negatives. No deletes.
Consistent Hashing	100–200 virtual nodes/physical node. Adding a node remaps only 1/n keys.

Quick Reference — NFR → Building Block

Given a problem or NFR, this maps you directly to the component that solves it.

NFR / Problem	Reach For
Read-heavy, slow DB	Redis Cache (cache-aside), Read Replicas, CDN
Write-heavy, DB bottleneck	Kafka, DB Sharding, WAL-based async replication
Real-time push to clients	WebSockets (bidirectional), SSE (server→client), Redis Pub/Sub (fan-out)
Latency < 100ms	Redis Cache, CDN, Redis Geo (for location), precomputed results
Proximity / location search	Redis Geo, Geohashing, Quadtree
Fan-out (1 event → many consumers)	Kafka (consumer groups), Redis Pub/Sub
Prevent double payment / double action	Idempotency Keys
Multi-step transaction, microservices	Saga Pattern
Multi-step transaction, single team, strong consistency	2PC
Payment discrepancy detection	Reconciliation
Slow/failing downstream service	Circuit Breaker
API abuse, DDoS protection	Rate Limiter (Token Bucket at API Gateway)
Scale cache/DB nodes without resharding everything	Consistent Hashing
Avoid DB lookup for non-existent keys	Bloom Filter
Metrics, sensor data, time-based queries	Time-Series DB
Crash recovery, replication	Write-Ahead Log (WAL)
Files, images, video at scale	Object Storage (S3)
User file upload without proxying through backend	Presigned S3 URL (client uploads direct)
Ensure uploaded files are safe before serving	Two-bucket strategy + async processing pipeline
Prove API request is authentic, prevent replay	HMAC request signing + Nonce
Store encryption private keys securely (PCI-DSS)	HSM (AWS CloudHSM / AWS KMS)
Audit trail without changing app code	CDC (Debezium → Kafka)
Guarantee each payment event processed exactly once	Kafka exactly-once semantics
Push payment status updates to merchant servers	Webhook delivery with exponential backoff
Track payment lifecycle, enforce idempotency	PaymentIntent state machine
Consistency between transaction state and consumers	Two-phase event model
Leader election, distributed config	etcd (modern), Consul, ZooKeeper (legacy)
Cluster membership, failure detection	Gossip Protocol
How etcd / CockroachDB reach consensus	Raft
Prevent two workers processing same job	Distributed Lock (Redis SETNX or etcd)
Services finding each other dynamically	Service Discovery (Consul, Kubernetes Service)

Building Blocks

Caching

Redis Cache

Cache Stampede (Thundering Herd on Expiry)

CDN (Content Delivery Network)

Load Balancer

Messaging & Queues

Kafka (Message Queue / Event Stream)

Real-time Communication

Redis Pub/Sub — Server-Side Fan-out

Mobile Push Notifications — APNs and FCM

Geospatial

Geohashing

Redis Geo

Quadtree

PostGIS — PostgreSQL Geospatial Extension

Database Techniques

Database Indexing

Redis Sorted Sets — In-Memory Ranked Queries

Denormalization and Materialized Views

Read Replicas

CQRS (Command Query Responsibility Segregation)

Database Proxy / Connection Pooler

Database Sharding

Time-Series Database

Bloom Filter

Search

PostgreSQL Full-Text Search

Elasticsearch

Reliability Patterns

Circuit Breaker

Rate Limiter

Load Shedding

Virtual Waiting Queue

Idempotency Keys

Consistent Hashing

Write-Ahead Log (WAL)

API Security

JWT Auth — The Login Flow

JWT Structure — What's Inside the Token

Two-Token Model

HMAC Request Signing

Nonce

HSM (Hardware Security Module)

App Attestation (Mobile Only)

Common Attack Vectors

Auth Flow — Full Picture

Distributed Coordination

Coordination Services — etcd, Consul, ZooKeeper

Gossip Protocol

Raft Consensus

Distributed Locks

Concurrency Control at the Database Layer

Service Discovery

Change Data Capture (CDC)

Object Storage & File Upload

Simple Upload — Presigned URL Pattern

Multipart Upload — Large Files

Why a single upload fails for large files

Identifying files by content (fingerprinting)

How S3 multipart upload works

Full upload flow end to end

Verifying chunk uploads with ETags

Downloads don't need chunking

Processing Pipeline — Two-Bucket Strategy

Download — Presigned URL Pattern

File Access Control — Data Modelling

File Security

File Sync Across Devices

Upload, Download, and Sync Optimizations

Video Streaming Pipeline

The Core Problem

Upload and Transcoding Pipeline

Adaptive Bitrate Streaming (ABR)

CDN Strategy for Video

Metadata and State

What-If Scenarios for Video

Financial & Payments Patterns

PaymentIntent State Machine

Two-Phase Event Model