Building Blocks
The components you reach for once NFRs tell you what the system needs. Each entry covers what it is, the problem it solves, which NFR triggers it, and a real system example.
Caching
Reduce latency and DB load by serving frequently read data from faster storage layers closer to the application.
Redis Cache
Stores hot data in memory for sub-millisecond reads. Eliminates repeated DB queries for the same data.
NFR triggers: Low latency, read-heavy scale.
Cache patterns — pick one per use case:
| Pattern | How It Works | Use When |
|---|---|---|
| Cache-aside (lazy) | App checks cache first. On miss, reads DB, writes to cache. | General reads. Cache only what's actually requested. |
| Write-through | On every write, app writes to cache AND DB together. | Read-heavy data that must stay fresh (user profiles). |
| Write-behind (write-back) | App writes to cache only. Cache flushes to DB asynchronously. | Write-heavy. Risk: data loss if cache crashes before flush. |
| Read-through | Cache sits in front of DB. Cache fetches from DB on miss automatically. | Simplifies app code. Cache layer handles DB reads. |
Cache eviction policies:
- LRU (Least Recently Used) — evict the item not accessed for the longest time. Default for most use cases.
- LFU (Least Frequently Used) — evict the item accessed least often. Better for skewed access patterns.
- TTL (Time To Live) — expire entries after a fixed time. Good for data that goes stale (search results, session tokens).
Cache invalidation strategies:
Most production systems combine approaches — short TTL as a safety net plus active invalidation for critical data.
| Strategy | How It Works | Best For |
|---|---|---|
| TTL expiry | Entry automatically expires after a fixed time. Simple, no coordination needed. | Data that can tolerate some staleness (search results, recommendation scores) |
| Write-through | App deletes or updates cache immediately when it writes to DB, in the same code path. | User profiles, inventory counts — data that must stay fresh |
| Event-driven | CDC (Debezium) reads DB WAL, publishes change events to a queue, cache invalidation consumer acts on them. Clean separation of concerns. Better than DB triggers (which are hidden, hard to debug, tightly coupled to DB). | Cross-service invalidation, complex dependency graphs |
| Tagged invalidation | Cache entries are tagged (e.g. user:123:posts). On update, invalidate all entries sharing that tag. |
Complex dependency patterns — a post update that should invalidate feed caches, search caches, and profile caches |
| Versioned cache keys | Include a version number in the key (event:123:v42). On update, increment version in DB → new key (event:123:v43). Old entries become unreachable, not deleted. No race conditions — a late writer can't overwrite new data because the DB forces a new version. Old entries expire naturally via TTL. |
Entity-level data (product pages, event details) where you need immediate consistency without invalidation broadcasts |
| Deleted items cache | Maintain a small fast cache of recently deleted/modified item IDs. When serving feeds or lists, filter against this set before returning results. Lets you serve mostly-correct cached data while the larger cache structures are invalidated in the background. | Content moderation, privacy changes, soft deletes |
TTL tiering — driven by staleness requirements: Let your NFRs dictate TTLs. If requirements say "search results can be 30 seconds stale," that's your TTL. Static data (venue info, product descriptions) → long TTL (hours). Volatile data (seat availability, inventory counts) → short TTL (seconds). User profiles → medium TTL (5 minutes). Don't pick arbitrary TTLs — derive them from consistency requirements.
When NOT to cache:
- Real-time collaborative systems — Google Docs, multiplayer games. Every change must be immediately visible to all users. Caching actively hurts here.
- Strongly consistent financial data — account balances, inventory when overselling is catastrophic. Use caching carefully with aggressive invalidation or avoid entirely.
- Write-heavy systems — if your read/write ratio is 1:1 or 2:1 (e.g. Uber driver location updates), cache hit rates will be too low to justify the complexity.
- User-specific private data — private messages, personal preferences. Only one user ever requests them, so there's no cache hit rate benefit. Don't put these in a CDN.
Read/write ratio context: Content-heavy apps typically see 10:1 to 100:1 reads-to-writes. Instagram: millions read a post for every one posted. YouTube: billions of views vs millions of uploads. The higher the ratio, the more aggressively you should cache. Write-heavy systems (location tracking, IoT telemetry) don't benefit from the same strategies.
Single-threaded design: Commands execute one at a time, in order. This makes Redis easy to reason about — no locks, no concurrent writes to the same key, and commands like INCR are atomic by default (increment happens in a single step, no two clients can interleave). Redis 6+ added I/O threads for network reads/writes, but command execution remains sequential. The bottleneck is almost never the single thread — it's network round trips, which pipelining solves.
Persistence modes — pick based on durability requirements:
- No persistence (pure cache): DB is the source of truth. Redis crash = cache gone, app rebuilds on demand from DB. Simplest, fastest.
- RDB snapshots: Redis dumps memory to disk every N seconds. Fast recovery on restart ("warm" data), but writes since last snapshot are lost.
- AOF (Append-Only File): Every write logged to disk.
appendfsync everysec(default) limits data loss to ~1 second.appendfsync always(every command) = near-zero data loss but slower throughput. - Replicas: Primary handles writes, replicas handle reads. If primary fails, promote a replica. Data loss = at most milliseconds of lag.
Scaling:
- Reads: Add replicas. Primary takes all writes; replicas serve read traffic.
- Writes — client-side sharding: App hashes the key and routes to one of several independent Redis nodes. Tools like Ketama use consistent hashing so adding/removing a node only remaps a fraction of keys (not all of them).
- Writes — Redis Cluster: Automatic sharding and failover across nodes. More operational complexity but no changes needed in the app.
Limits:
- Single instance: up to ~100GB RAM (practical sweet spot: 10–64GB)
- Throughput: ~100K–1M ops/sec
- Max key or value size: 512MB
- Max keys per instance: ~4 billion
Real example: Twitter precomputes home timelines into Redis. A tweet read is a Redis list lookup, not a DB join across followers.
Cache Stampede (Thundering Herd on Expiry)
When a popular cache entry expires, all concurrent requests see a cache miss simultaneously and race to rebuild it from the DB. At 100K reads/sec, that's 100K identical DB queries in the same instant — the DB collapses under a self-inflicted DDOS from your own application. This is distinct from a Redis crash (Scenario 1 in What-If) — the cache is healthy, but one entry's TTL expired.
Three mitigations:
1. Probabilistic early refresh — serve cached data while background-refreshing before TTL expires. As the entry ages, each request has a small increasing probability of triggering a background refresh. At 50 minutes into a 60-minute TTL: 1% chance of refresh. At 58 minutes: 10%. At 59 minutes: 20%. By expiry, the entry has already been refreshed. No stampede ever occurs. This is the cleanest solution for most cases.
2. Background refresh process — a dedicated job continuously refreshes critical cache entries before they expire. Homepage cache refreshes every 50 minutes on a 60-minute TTL. Guarantees the entry never goes cold. Cost: infrastructure complexity and wasted refreshes for entries nobody requested. Worth it for your highest-traffic keys.
3. Request coalescing — when multiple requests see the same cache miss, only one fetches from the DB; the rest wait for that result. The key insight: even with millions of users hitting the same key, your DB only receives N requests — one per app server doing the coalescing. Not a full solution (the rebuilding request still hits DB) but dramatically reduces the blast radius.
Cache miss for key "celebrity:123"
→ App server 1: acquires in-flight lock, fetches from DB, populates cache
→ App server 2: sees in-flight lock, waits for app server 1's result
→ Result shared to all waiters — DB hit exactly once per app server
Hot key fanout — for extreme loads where even request coalescing isn't enough: store identical copies under multiple keys (feed:taylor-swift:1, feed:taylor-swift:2, … feed:taylor-swift:10). Readers randomly pick one. Spreads 500K req/sec across 10 keys at 50K each. Tradeoff: memory × N copies, and invalidation must clear all replicas.
In an interview: Mention probabilistic early refresh or background refresh for write-through caches on critical data. Mention request coalescing when discussing how you'd handle a celebrity post going viral.
Redis Sorted Sets (Leaderboards, Rankings)
A set where every member has a numeric score. Range queries by score are O(log n).
NFR triggers: Low latency ranked queries.
Operations: ZADD, ZRANGE, ZRANK, ZREVRANGE
Limits:
- Max members per sorted set: ~4.3 billion
- Memory per member: ~50–100 bytes overhead + key + value size
- At 10M members: ~500MB–1GB RAM — still comfortable on a typical instance
- All operations O(log n) — fast even at millions of members
Real example: Gaming leaderboard — score is points, member is user ID. ZREVRANGE leaderboard 0 9 returns top 10 instantly.
Redis Pub/Sub
Publisher sends a message to a channel. All subscribers to that channel receive it instantly. Fire-and-forget — no persistence, no queue.
NFR triggers: Real-time notifications, fan-out to multiple consumers.
How it works:
Publisher → PUBLISH channel "message"
Subscriber → SUBSCRIBE channel → receives "message"
Limitation: Messages are lost if no subscriber is listening at that moment. Not durable. Use Kafka if you need delivery guarantees.
Limits:
- Throughput: millions of messages/sec — very fast, no disk I/O
- Channels: unlimited (memory bound — each active channel with subscribers uses RAM)
- No message history: a subscriber joining late gets nothing from before they subscribed
- No persistence: Redis crash = all in-flight messages lost
- Not suitable for > 10K active channels with high-volume subscribers without careful memory planning
Real example: Chat app — when a user sends a message, publish to a channel named after the chat room. All connected users subscribed to that channel receive it instantly.
CDN (Content Delivery Network)
Caches static and cacheable content at edge nodes geographically close to users.
NFR triggers: Low latency for global users, high scale reads of static content.
What it caches: Images, JS/CSS, video segments, API responses with cache headers.
Cache hit: Served from edge node ~5–20ms from user. No origin request. Cache miss: Fetches from origin, caches for next time.
Real example: Netflix stores video chunks in CDN edge nodes. 95%+ of video traffic never hits Netflix origin servers.
Messaging & Queues
Decouple producers from consumers and absorb write bursts so a slow downstream never blocks or overwhelms the rest of the system.
Kafka (Message Queue / Event Stream)
Durable, distributed log of messages. Producers append to a topic. Consumers read at their own pace. Messages are retained (not deleted on consume).
NFR triggers: Write-heavy scale, async processing, fault tolerance, event replay.
Key concepts:
- Topic — named stream of messages (e.g.
user-events) - Partition — topics are split into partitions for parallelism. Messages in a partition are ordered.
- Consumer group — multiple consumers share a topic, each partition assigned to one consumer. Scale consumers independently.
- Offset — each message has an offset. Consumers track their position. Can replay from any offset.
- Retention — messages kept for N days regardless of consumption. Enables replay and audit.
acks setting and durability:
acks=0— fire and forget. Fastest. No guarantee.acks=1— leader confirms. Fast. Leader crash = message lost.acks=all— all in-sync replicas confirm. Slowest (+2–5ms). No data loss.
Limits:
- Throughput: millions of messages/sec per broker. LinkedIn processes 7 trillion messages/day across their cluster.
- Message size: default 1MB, configurable up to 10MB+ (large messages reduce throughput significantly)
- Partitions: practical limit ~4K per broker — more partitions = more memory and slower failover
- Consumer lag: consumers can fall hours or days behind and catch up without message loss (within retention window)
- Retention: configurable by time (days/weeks) or size (GB per partition)
- Latency: typically 5–15ms end-to-end at
acks=1. Up to 10–30ms atacks=all.
Exactly-once semantics (critical for financial systems):
Without exactly-once, a consumer crash after processing but before committing its offset causes reprocessing on restart — double payments. Kafka solves this with:
- Idempotent producers: Producer assigns a sequence number to each message. Broker deduplicates retries. No duplicate writes even on retry.
- Transactional API: Atomic read-process-write. Offset commit and output write happen in one transaction. Either both succeed or neither does.
- Result: Each payment event processed exactly once — not duplicated, not lost.
Consumer groups per service (independent offsets):
Each service (reconciliation, webhook delivery, analytics) has its own consumer group with a unique group_id. This means every service receives all events independently — they don't compete. Each maintains its own offset (position in the log). Reconciliation service being slow doesn't affect webhook delivery.
Within a consumer group, each partition is assigned to exactly one consumer instance. Max parallelism = partition count. To scale: add consumer instances up to partition count. If a consumer fails, Kafka auto-rebalances partitions to remaining instances.
Real example: Uber publishes every GPS location update to Kafka. Multiple consumers (routing, surge pricing, analytics) read independently at their own pace. Stripe uses Kafka with consumer groups per service — webhook delivery, reconciliation, and analytics each consume the same payment events independently.
Write-Ahead Log (WAL)
Before any data change is applied, the change is written to an append-only log on disk. On crash, replay the log to recover.
NFR triggers: Durability (RPO = seconds), crash recovery.
How it works: Every DB write first appends to WAL. If the server crashes mid-write, the WAL is replayed on restart. No data loss beyond what's in the log.
Also used for replication: PostgreSQL streams its WAL to replicas. Replicas replay the log to stay in sync. This is how read replicas work internally.
Real example: PostgreSQL, MySQL InnoDB, Kafka itself — all use WAL for durability.
Real-time Communication
When the client needs to receive data pushed from the server (chat, notifications, live scores), you have four options. Pick based on latency needs, direction, and infrastructure cost.
| Technique | Direction | Connection | Latency | Use When |
|---|---|---|---|---|
| Short Polling | Client → Server (repeated) | New HTTP request every N seconds | High (N seconds) | Simple. OK for non-urgent updates (email inbox refresh). |
| Long Polling | Client → Server (held open) | Client sends request, server holds it open until data is ready, then responds | Medium (~seconds) | Better than short polling. No persistent connection. |
| Server-Sent Events (SSE) | Server → Client (one-way) | Single persistent HTTP connection, server pushes events | Low | Notifications, live feeds. Simple to implement. No bidirectional needed. |
| WebSockets | Bidirectional | Persistent TCP connection, both sides can send anytime | Very low (~ms) | Chat, multiplayer gaming, collaborative editing, live trading. |
Short Polling: Client polls every 5 seconds. Simple but wasteful — most requests return nothing.
Long Polling: Client sends request. Server holds it open (up to 30s). When data arrives, server responds. Client immediately sends the next request. Simulates push without a persistent connection.
SSE: HTTP connection stays open. Server pushes data: ... events whenever it wants. Client uses EventSource API. One-way only (server → client). Simpler than WebSockets for notifications.
SSE wire format — each event is plain text:
data: {"seat": "A12", "status": "reserved"}\n\n ← basic event
id: 42\n\n ← lets client send Last-Event-ID on reconnect
event: seat-update\n\n ← named event type
retry: 3000\n\n ← tell client to wait 3s before reconnecting
Two newlines (\n\n) end each event. EventSource auto-reconnects and sends Last-Event-ID so the server can resume from where the client left off.
SSE production considerations (when you control the stack):
- Idle timeouts: Cloud load balancers (GCP ~30s, AWS ALB ~60s default) will kill quiet SSE connections. Fix: send periodic heartbeat comments from the server every ~20s:
": keepalive\n\n". These are ignored byEventSourcebut keep the connection alive. - Proxy buffering: Proxies like Nginx buffer responses by default — they wait for the "full" response before forwarding, which breaks infinite SSE streams. Fix:
proxy_buffering offin Nginx config. One-liner, not an ongoing concern once set. Only matters if you don't control the proxy (corporate proxies, CDN edge nodes on lower-tier plans). - HTTP/2: Removes the HTTP/1.1 browser limit of 6 connections per domain. Multiple SSE streams now multiplex over one connection — the "can't have multiple live streams" concern is gone on HTTP/2.
- Mobile clients: OS may suspend the network connection when the app goes to background. Network switches (WiFi → LTE) also drop connections. SSE
EventSourceauto-reconnects, andLast-Event-IDhandles resumability. Design your server to replay missed events by ID. - Pod restarts / scaling: Long-lived connections mean all clients on a restarting pod reconnect at once. Stagger restarts and implement reconnect jitter on the client to avoid thundering herd.
WebSockets: Full-duplex persistent TCP connection. Both sides send messages anytime. Higher infrastructure cost (stateful connections — load balancer must route same client to same server, or use Redis Pub/Sub to broadcast across servers).
SSE vs WebSocket — when SSE wins: SSE is plain HTTP — no protocol upgrade, works through most firewalls and proxies, EventSource handles reconnect automatically, simpler to implement and monitor. If you only need server → client push (notifications, live feeds, progress updates), SSE is the better default. Only reach for WebSockets when you need bidirectional messaging.
Limits:
- Connections per server: ~10K–100K out of the box, but with OS tuning easily 1M+ on a single server
- Memory per connection: ~2–8KB kernel overhead → 100K connections ≈ 200MB–800MB RAM — this is the real bottleneck, not port numbers
- Scale-out: requires sticky sessions (same client always hits same server) or a fan-out layer (Redis Pub/Sub) so any server can push to any client
The 65,535 connection limit is a myth. That number is the range of TCP port numbers — not the number of connections a server port can accept. Each TCP connection is identified by a 4-tuple: source IP + source port + destination IP + destination port. Your server listens on one port (e.g. 443), but every client brings its own source IP and source port — so the combination is unique per client. A single listening port can theoretically accept connections from billions of clients. The real limits are:
- File descriptors — Linux gives each process 1,024 open files by default. Every connection is one fd. Tune with
ulimit -n 1000000andfs.file-maxin/etc/sysctl.conf. - RAM — ~2–8KB kernel overhead per connection. 1M connections ≈ 2–8GB just for kernel state, before your app's per-connection memory.
- CPU — context switching and event loop pressure as connection count grows.
With tuning (ulimit, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog), servers routinely handle 1M+ concurrent connections. This is what the C10M problem (10 million connections) research is about.
Real examples:
- Slack: WebSockets for real-time messages.
- Twitter feed: SSE for new tweet notifications.
- GitHub CI status page: SSE or long polling.
- Stock ticker: WebSockets for live prices.
- Ticketmaster seat map: SSE to push seat reservations in real-time — when any user reserves a seat, push an event to all viewers of that event's seat map so it goes grey instantly without a page refresh.
Mobile Push Notifications — APNs and FCM
The mandatory platform gateways for delivering notifications to iOS and Android devices. You cannot push directly to a phone — Apple and Google each run a persistent connection infrastructure to every device on their platform. Your server talks to their gateway; the gateway handles delivery.
NFR triggers: Real-time notifications, user engagement, background alerts.
The two gateways:
| APNs | FCM | |
|---|---|---|
| Run by | Apple | Google (Firebase) |
| Covers | iOS, macOS, watchOS | Android, web push (Chrome, Firefox) |
| Auth | Device token + JWT or certificate | Registration token + service account key |
| Offline behaviour | Stores last notification, delivers on reconnect (TTL configurable) | Same — queues and delivers when device is reachable |
| Fan-out | One message per device | Topics: one message fans out to all subscribers |
Flow:
Mobile app launches for first time
→ registers with APNs / FCM
→ receives device token / registration token
→ app sends token to your backend → stored against user
Your Notification Service receives trigger (ticket confirmed, seat available):
→ look up user's device token + platform
→ iOS: POST to api.push.apple.com with token + payload
→ Android: POST to fcm.googleapis.com with token + payload
→ gateway delivers to device (immediately if online, queued if offline)
You never talk to the device directly. The token is the address.
Payload:
{
"title": "Your tickets are confirmed",
"body": "Taylor Swift · Madison Square Garden · June 14",
"data": { "eventId": "123", "screen": "booking_confirmation" }
}
data is a silent background payload — app acts on it (update badge count, prefetch content) without showing a visible notification.
Scaling — decouple from request path:
At high volume (millions of notifications per event), never call APNs/FCM synchronously inline. Publish notification jobs to a queue and have worker pools call the gateways asynchronously:
Ticket confirmed → Notification Service publishes to Kafka "notifications" topic
Worker pool reads from topic → calls APNs / FCM in parallel
→ log delivery status per notification
This decouples your app from gateway latency and handles burst fan-outs gracefully. APNs supports HTTP/2 connection pooling — one persistent connection multiplexes thousands of notifications rather than opening a new connection per message.
Abstraction options — you rarely call the raw APIs directly:
- firebase-admin SDK — official Google SDK, handles FCM for any backend language
- AWS SNS — single API that routes to APNs, FCM, or SMS based on platform. Removes the need to manage two gateways separately.
- OneSignal / Twilio Notify — managed services that abstract both gateways, add analytics, segmentation, and retry logic
In an interview: "Device tokens are registered on first app launch and stored in our User Service. Our Notification Service publishes jobs to Kafka; worker pools consume and call APNs for iOS and FCM for Android. We never block the booking request path on notification delivery — it's fully async."
Virtual Waiting Queue
A controlled admission system placed in front of a high-demand service. Users queue before they can access the booking page. The queue meters user flow to prevent system overload and, crucially, to improve UX — during a Taylor Swift on-sale, the seat map fills in milliseconds. Without a queue, users see a chaotic, constantly-shifting map and repeatedly click seats that are already gone. With a queue, they enter when there are seats available for them, or they get a clear wait time.
The senior vs staff distinction: The technically harder solution (real-time seat map via SSE) handles moderately popular events well but breaks down at extreme scale. The virtual queue is architecturally simpler and solves the UX problem more completely. Knowing when to trade technical complexity for a better user experience is a staff-level signal in interviews.
Architecture:
User requests booking page
→ placed in virtual queue (not shown the seat map yet)
→ SSE connection established for position updates
→ Redis sorted set: ZADD queue:{eventId} <timestamp> <userId>
Queue manager (periodic job or event-driven):
→ ZPOPMIN queue:{eventId} N ← dequeue N users
→ SADD admitted:{eventId} <userId> (with TTL, e.g. 10 min)
→ push "you're admitted" via SSE connection
User's browser receives SSE event → redirects to booking page
Booking Service:
→ on every reservation request, check SISMEMBER admitted:{eventId} <userId>
→ reject if not admitted — even if request is forged or replayed
Key design decisions:
- Redis Sorted Set (
ZADD queue:{eventId} <timestamp> <userId>) — timestamp as score gives FIFO ordering.ZPOPMINdequeues the oldest entries atomically.ZRANKreturns a user's position for the "you're #4,231 in line" display. - SSE over WebSocket — the queue only needs server → client communication (position updates, admission notification). SSE is simpler: no protocol upgrade, works through HTTP proxies, auto-reconnects natively. Only use WebSocket if you need bidirectional messaging (e.g. user can send a "cancel my place" message).
- Admitted set with TTL —
SADD admitted:{eventId} <userId> EX 600(10 minute window). The Booking Service checks this set before allowing any reservation. If the user's admission expires (they walked away), the slot opens for the next person in queue. TTL should match your checkout window. - Dequeue rate — tune based on how fast seats are being sold, not a fixed rate. If 100 seats remain and each user takes ~5 minutes to book, admit 20 users every minute. Admitting too many causes the seat map to fill before they book; too few means slow throughput.
What to tell the interviewer: "For a Taylor Swift on-sale, I'd enable the virtual queue as an admin-controlled feature. Normal events don't need it — SSE seat map updates are enough. The queue sits in front of the Booking Service, not the Event Service, so browsing and discovery still work normally. I'd use Redis Sorted Set for ordering, SSE for updates, and an admitted set with TTL as the gate the Booking Service checks."
Geospatial
Techniques for storing and querying location data at scale — proximity search, radius queries, and indexing millions of moving entities.
Why you can't use a regular database for high-frequency location updates:
Two problems compound each other at scale:
Write volume: 10M drivers sending location every 5 seconds = 2M writes/sec. A well-tuned PostgreSQL handles ~5–10K writes/sec before sharding. DynamoDB can absorb the throughput but at on-demand pricing ($1.25/million WRUs), 2M writes/sec of 100-byte payloads costs ~$200K/day. Neither is viable as a naive solution.
B-tree indexes don't work for 2D geo queries: A standard B-tree index works on one dimension — it can find all rows where lat BETWEEN 40.0 AND 41.0 efficiently, but combining that with lng BETWEEN -74.0 AND -73.0 requires two separate index scans merged by the query planner. For a true radius query (find all points within 5km of this coordinate), there's no efficient B-tree representation. The DB falls back to a full table scan calculating distance for every row. At millions of rows this is a non-starter.
Three approaches in order of quality:
| Approach | How | Problem |
|---|---|---|
| Direct DB writes | Write every update to DB, query with lat/lng columns | Write overload + B-tree can't do radius queries efficiently |
| Batch processing | Aggregate updates over N seconds, write to DB in batches | Reduces write load but location data is always N seconds stale — suboptimal matches |
| Redis Geo (best) | Write to Redis in-memory geo store, query with GEOSEARCH | Real-time, handles 2M writes/sec, sub-ms proximity queries |
Geohashing
Encodes a (latitude, longitude) pair into a short string. Nearby locations share a common prefix.
NFR triggers: Proximity search, location-based queries.
How it works: The world is recursively divided into a grid. Each cell gets a string like u4pruydqqv. The longer the string, the smaller the cell. Nearby points share prefixes: u4pru and u4prv are neighbors.
Use in a system: Store geohash as a DB column. To find nearby items, query for all rows whose geohash starts with the same prefix. Fast with a string index.
Limitation: Edge cases at grid boundaries — two points physically close can have different prefixes if they're on opposite sides of a cell boundary. Fix by also querying the 8 neighboring cells.
Precision levels:
| Length | Cell size |
|---|---|
| 4 | ~20 km |
| 5 | ~2.4 km |
| 6 | ~610 m |
| 7 | ~76 m |
| 8 | ~19 m |
Real example: Uber stores driver locations as geohashes. Finding nearby drivers = prefix match on geohash string.
Redis Geo
Redis sorted set where members are stored by a 52-bit geohash score. Enables radius and distance queries in O(n + log m). The best solution for high-frequency location updates at scale.
Commands:
GEOADD drivers 13.361 38.115 "driver:42"— add or update location (overwrites previous position for same member — you always have the latest)GEODIST drivers driver:42 driver:99 km— distance between two membersGEOSEARCH drivers FROMMEMBER driver:42 BYRADIUS 5 km ASC— find members within radius, sorted by distance (Redis 6.2+ — preferred over deprecated GEORADIUS)GEOSEARCH drivers FROMLONLAT 15.0 37.0 BYRADIUS 5 km ASC COUNT 10— find nearest 10 members from a coordinateGEOSEARCH drivers FROMLONLAT 15.0 37.0 BYBOX 10 10 km ASC— find members within a bounding box
GEOSEARCH (Redis 6.2+) replaces the older GEORADIUS and GEORADIUSBYMEMBER commands. It supports both radius and bounding box queries, and can search from either a fixed coordinate or an existing member.
Precision levels (geohash string length):
| Precision | Cell size | Use for |
|---|---|---|
| 4 chars | ~20 km | City-level search |
| 5 chars | ~2.4 km | Neighborhood |
| 6 chars | ~610 m | Street-level |
| 8 chars | ~19 m | Building-level |
Redis as the live location store — the full pattern:
For a ride-sharing system with 10M drivers updating every 5 seconds, Redis is the primary store for live positions. The DB stores historical trip data, not current locations.
Driver app sends GPS update every 5 seconds
→ Location Service: GEOADD drivers <lng> <lat> "driver:42"
→ Also: ZADD driver_timestamps <unix_timestamp> "driver:42"
Rider requests a match:
→ GEOSEARCH drivers FROMLONLAT <rider_lng> <rider_lat> BYRADIUS 5 km ASC COUNT 20
→ Returns nearest 20 active drivers instantly
Stale driver cleanup (runs every 30 seconds):
→ ZRANGEBYSCORE driver_timestamps 0 <now - 30s> ← drivers not updated in 30s
→ ZREM driver_timestamps <stale_drivers>
→ ZREM drivers <stale_drivers>
The companion driver_timestamps sorted set (score = last update timestamp) lets you efficiently find and remove offline drivers. Without it, offline drivers stay in the geo set indefinitely and appear available to riders.
Durability tradeoff — location data is ephemeral by nature:
Redis is in-memory. If it crashes, live location data is lost. For most data this is unacceptable — but driver locations are special: every driver re-reports their position within 5 seconds. A Redis crash means 5 seconds of stale data before the geo set is fully rebuilt. This is an acceptable tradeoff explicitly worth stating in an interview.
Mitigation options if you need stronger guarantees:
- RDB snapshots — Redis dumps memory to disk every N seconds. Fast recovery, small data loss window.
- AOF (Append-Only File) — every write logged to disk. Near-zero data loss. Higher I/O overhead.
- Redis Sentinel — automatic failover for a primary + replica setup. If primary dies, Sentinel promotes a replica to primary in seconds. Location data on the replica is at most milliseconds stale.
- Redis Cluster — shards data across nodes. Each shard has its own primary + replicas. For 10M drivers you'd shard by city or geohash prefix to distribute load.
Redis Sentinel vs Redis Cluster:
| Redis Sentinel | Redis Cluster | |
|---|---|---|
| Purpose | High availability — automatic failover | Horizontal scale — sharding across nodes |
| Data distribution | All data on one primary (replicated) | Data split across multiple primaries |
| Use when | Dataset fits on one node, need HA | Dataset too large for one node OR write throughput exceeds single node |
| Failover | Sentinel promotes replica automatically (~seconds) | Built-in, per shard |
Limits:
- Built on sorted sets — same ~4.3 billion member limit per key
- Coordinate precision: ~0.6mm at equator (52-bit geohash internally)
- Radius query: O(n + log m) — fast up to millions of members
- For 10M drivers: split into multiple geo keys by city/region to keep each key manageable
Write optimization patterns for high-frequency location updates:
At 2M writes/sec you need to be deliberate about how updates reach Redis. Several techniques reduce pressure:
1. Multi-member GEOADD (batch in one command): Redis supports adding multiple members in a single GEOADD call. Instead of one network round trip per driver, batch several together:
GEOADD drivers
-74.006 40.713 "driver:42"
-73.991 40.722 "driver:99"
-74.012 40.708 "driver:17"
One round trip for N updates. Simple and always worth doing.
2. Redis pipelining: Send multiple commands to Redis in one network round trip without waiting for each response. The Location Service buffers updates for 50–100ms and flushes as a pipeline. Throughput improvement is proportional to the number of commands batched — 100 commands in one pipeline vs 100 individual round trips.
3. Write coalescing: If a driver sends 3 updates in 200ms, only the last position matters. Coalesce in the Location Service before writing — deduplicate by driver ID, keep the latest. Reduces Redis write volume without any staleness tradeoff (you always write the most recent position).
4. Kafka as write buffer:
Driver app → Location Service → Kafka topic "location-updates"
→ Location Consumer → Redis GEOADD (batched)
Kafka absorbs traffic spikes. The consumer reads in micro-batches (every 100ms), coalesces by driver ID, and flushes to Redis. Decouples ingestion rate from Redis write rate. Especially useful during rush hour spikes when update volume surges 5–10× baseline.
5. Region-based sharding: Split the geo set by city or geohash prefix:
GEOADD drivers:nyc ...
GEOADD drivers:la ...
GEOADD drivers:chi ...
Each shard lives on a different Redis node. Distributes write load and keeps each geo set small enough for fast queries. Proximity searches only need to hit the relevant city shard — a rider in NYC never queries drivers:la.
6. Client-side batching (last resort): Buffer updates on the driver's device and send every 2–3 seconds instead of every 1 second. Reduces network calls and server load but introduces positional staleness. Acceptable for ride-sharing (position accuracy of a few seconds is fine for matching) but not for turn-by-turn navigation.
In an interview: "I'd start with multi-member GEOADD and write coalescing on the Location Service — both are zero-latency tradeoffs. For the rush hour spike, I'd add Kafka as a write buffer between the Location Service and Redis. Region sharding handles the scale-out as we grow to more cities."
Real example: Tinder matches, Uber driver dispatch, Yelp restaurant search — all can use Redis Geo for "find X near me" queries.
Quadtree
Recursively divides 2D space into four quadrants. Each node subdivides until a cell has ≤ N points. Used when you need dynamic spatial indexing with fast insert/delete.
NFR triggers: Proximity queries with frequently updating locations (moving drivers, users).
How it works: Start with the whole world as the root. If a region has more than N points, split it into 4 quadrants. Recurse. To find nearby points, traverse down to the relevant quadrant.
vs Geohash: Quadtree adapts density (dense city = smaller cells, sparse desert = large cells). Geohash uses fixed-size grid cells.
Real example: Google Maps, game engines for collision detection, location services with millions of moving entities.
PostGIS — PostgreSQL Geospatial Extension
If you're already on PostgreSQL and your location write volume is manageable (< 50K writes/sec), PostGIS gives you proper geospatial queries without running a separate Redis cluster.
PostGIS adds geometry/geography data types and spatial functions to PostgreSQL. Under the hood it uses GiST indexes (not B-tree) which are designed for multi-dimensional data and support true radius queries efficiently.
-- Add a PostGIS geography column (uses sphere math — accurate for real distances)
ALTER TABLE drivers ADD COLUMN location geography(POINT, 4326);
-- Update driver position
UPDATE drivers SET location = ST_MakePoint(-74.006, 40.7128)
WHERE driver_id = 42;
-- GiST index makes proximity queries fast
CREATE INDEX idx_drivers_location ON drivers USING GIST(location);
-- Find all drivers within 5km of a rider — uses GiST, not full table scan
SELECT driver_id, ST_Distance(location, ST_MakePoint(-73.99, 40.72)::geography) AS dist_m
FROM drivers
WHERE ST_DWithin(location, ST_MakePoint(-73.99, 40.72)::geography, 5000)
ORDER BY dist_m
LIMIT 20;
ST_DWithin is the key function — it uses the GiST index to eliminate most rows before calculating exact distances.
When PostGIS is enough vs when you need Redis Geo:
| PostGIS | Redis Geo | |
|---|---|---|
| Write throughput | ~5–50K writes/sec | Millions/sec |
| Query latency | 10–50ms | < 1ms |
| Durability | Full ACID | Configurable (RDB/AOF) |
| Operational cost | Zero (already on Postgres) | Separate Redis cluster |
| Use when | Moderate scale, need durability, location is part of complex query with joins | High-frequency updates (ride-sharing, IoT), need sub-ms proximity queries |
For Uber-scale (10M drivers, 2M writes/sec): Redis Geo. For Yelp-scale (restaurant search, low write frequency): PostGIS is sufficient and simpler.
Database Techniques
Patterns for scaling read throughput, managing connections under load, splitting data across nodes, and recovering from failures.
Database Indexing
An index is a sorted lookup structure that lets the DB jump directly to matching rows instead of scanning the whole table. Without one, every query is a full table scan — O(n). With one, it's O(log n). At 10 million rows, that's the difference between checking 20 index entries vs reading 2GB of data.
Index types:
| Type | How It Works | Best For |
|---|---|---|
| B-tree (default) | Balanced tree — supports equality, range, sort, prefix | General purpose — most queries |
| Hash | Hash of the key → row pointer | Exact-match lookups only (WHERE email = '...'). No ranges. |
| Full-text | Inverted index of words | Text search (LIKE '%word%' won't use B-tree) |
| GiST / GIN (Postgres) | Generalized for complex types | Geo queries, JSONB, arrays |
Compound indexes — column order matters:
-- Index on (status, created_at)
-- Helps: WHERE status = 'active'
-- Helps: WHERE status = 'active' AND created_at > '2024-01-01'
-- Doesn't help: WHERE created_at > '2024-01-01' (leading column missing)
Put the most selective column first. Queries filtering only on trailing columns can't use the index.
EXPLAIN to verify index use:
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
-- Seq Scan on users → no index, full table scan
CREATE INDEX idx_users_email ON users(email);
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
-- Index Scan using idx_users_email → O(log n) lookup
In an interview: When sketching your schema, state which columns you'd index. "I'd index user_id on the orders table since most queries filter by user, and add a compound index on (event_id, status) for the seat availability query." Under-indexing kills far more production systems than over-indexing. Index maintenance overhead on writes is real but rarely the bottleneck.
Denormalization and Materialized Views
Normalization splits data across tables to eliminate redundancy. This is clean for writes but expensive for reads — every query needs joins across multiple tables. Denormalization reverses this: store redundant data in one place to make reads a single lookup.
When to denormalize: Read/write ratio above ~10:1. The extra write complexity (update multiple places when a user changes their name) is justified by eliminating expensive joins at read time.
-- Normalized: join 3 tables on every order page load
SELECT u.name, o.order_date, p.product_name, p.price
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON oi.product_id = p.id
WHERE o.id = 12345;
-- Denormalized: single-table read from order_summary
SELECT user_name, order_date, product_name, price
FROM order_summary
WHERE order_id = 12345;
Tradeoff: if a user changes their name, you must update every row in order_summary that references them. Rare update + frequent read = worth it. Frequent update + rare read = don't denormalize.
Materialized views take this further by precomputing expensive aggregations and storing the results:
-- Instead of recomputing on every page load:
CREATE MATERIALIZED VIEW product_ratings AS
SELECT product_id, AVG(rating) as avg_rating, COUNT(*) as review_count
FROM reviews GROUP BY product_id;
-- Refresh periodically (background job, or on write via trigger)
REFRESH MATERIALIZED VIEW product_ratings;
Product rating pages now read one precomputed row instead of aggregating millions of reviews. Refresh frequency is your staleness/freshness knob. PostgreSQL supports REFRESH MATERIALIZED VIEW CONCURRENTLY — reads still work during refresh.
In an interview: Mention denormalization when you have a read-heavy query that requires joining multiple large tables. Mention materialized views for aggregation-heavy reads (dashboards, leaderboards, analytics summaries).
Read Replicas
Copies of the primary database that handle read queries. Primary handles all writes and replicates changes to replicas. All writes go to the primary (leader); reads distribute across replicas (followers).
NFR triggers: Read-heavy scale, high availability.
When to add replicas: A well-indexed single DB handles ~50,000–100,000 reads/sec on modern hardware. Above that threshold — or when read load is affecting write performance — add read replicas. This is a rough estimate; actual numbers depend on query complexity, data size, and hardware.
Synchronous vs asynchronous replication:
| Synchronous | Asynchronous | |
|---|---|---|
| How it works | Primary waits for replica to confirm before acking the write | Primary acks immediately, replication happens in background |
| Replication lag | Zero — replica always current | 10ms–seconds depending on network and load |
| Write latency | +5–20ms | No impact |
| Data loss on primary crash | Zero (replica has everything) | Seconds of writes lost (RPO = seconds) |
| Use when | Financial systems, anything where stale reads are unacceptable | Social feeds, content platforms where slight staleness is fine |
Replication lag and read-your-writes: With async replicas, a user who just posted something might not see their own post if their next read goes to a lagging replica. Fix: route a user's own reads to the primary for a short window after their writes (sticky routing). Route all other reads to replicas.
Replicas also provide redundancy: If the primary fails, promote a replica to become the new primary — minimizing downtime. This is the same failover mechanism described in What-If Scenario 2.
Limits:
- PostgreSQL: supports hundreds of replicas technically — practical limit 5–10 before replication overhead stresses the primary
- Replication lag: < 100ms same AZ (async), seconds cross-region
- Each replica is a full copy of the entire dataset — scales reads linearly, does nothing for write throughput or dataset size. If the dataset is too large for fast queries, add sharding or caching instead.
Real example: Instagram routes feed reads to replicas, profile writes to primary. Replica lag is acceptable for feed freshness.
CQRS (Command Query Responsibility Segregation)
Separate the write model (commands) from the read model (queries) entirely. Instead of one model that handles both, writes go to a normalized, consistent store and reads are served from a separate, denormalized store optimized purely for fast retrieval.
NFR triggers: Write-heavy systems where reads and writes compete for resources, high read latency on complex joins, or when the read shape differs significantly from the write shape.
How it works:
Client → Write API → Command Handler → Primary DB (normalized, consistent)
↓ event / async replication
Client → Read API → Query Handler → Read Store (denormalized, precomputed)
The read store can be a Redis cache, Elasticsearch index, a materialized view, or a read replica with a different schema — whatever makes reads fast. It is updated asynchronously after each write.
Relation to what you've already seen:
- Read replicas separate read and write traffic at the DB layer — same schema, different nodes
- Denormalization separates read and write data shape in one DB — one model optimized for reads
- CQRS separates both: different models, different stores, different APIs end-to-end
When to use:
- Write-heavy systems where read queries are slowing down writes (contention on the same tables)
- Read shape is very different from write shape — e.g., you write normalized orders but serve a precomputed order-summary view
- You need to scale reads and writes independently
When NOT to use:
- Simple CRUD apps — the added complexity rarely pays off below ~10M DAU
- Strong consistency required across read and write — CQRS read stores are eventually consistent
Real example: Twitter writes tweets to a primary DB, then fans out asynchronously to per-user feed caches (Redis). The write model is normalized; the read model is a precomputed list. Classic CQRS.
Database Proxy / Connection Pooler
Sits between your app and the database. Maintains a small pool of real DB connections and multiplexes many app connections through them. No application code changes — just point the connection string at the proxy instead of the DB directly.
NFR triggers: Scale (connection exhaustion), fault tolerance (queues requests during DB failover).
How it works:
500 app threads → Proxy → 20–100 real connections → Database
App thinks it's talking to the DB directly. The proxy manages the real connection pool. During DB primary failover, the proxy queues incoming requests for a few seconds then reconnects to the promoted replica — the app never knows the primary changed.
Pooling modes:
- Transaction pooling (most common) — connection returned to pool after each transaction. Most efficient. 500 app threads → 20 real DB connections.
- Session pooling — one real connection per client session. Less efficient but compatible with all DB features.
Tools by database:
| Database | Tool | Notes |
|---|---|---|
| PostgreSQL | PgBouncer | Lightweight, connection pooling only. Most common. |
| PostgreSQL | Pgpool-II | Heavier — adds load balancing and failover on top of pooling. |
| MySQL / MariaDB | ProxySQL | Connection pooling + query routing + read/write splitting. |
| MySQL / MariaDB | MySQL Router | Official MySQL proxy. Part of InnoDB Cluster. Handles failover. |
| MySQL at scale | Vitess | Built by YouTube. Adds horizontal sharding on top of pooling. CNCF project. |
| Any (AWS managed) | RDS Proxy | Supports PostgreSQL, MySQL, MariaDB, SQL Server. Fully managed, no ops overhead. |
Limits:
- Reduces DB connections from thousands to tens — prevents connection exhaustion at scale
- Adds ~0.1–0.5ms overhead per query
- During failover: queues requests for ~5–10 seconds then reconnects transparently
What about smart DB clients?
Modern clients can also detect primary failure and reconnect — but with caveats:
- Aurora cluster endpoint — Aurora updates the endpoint to the new writer automatically. Client reconnects to the same URL. Cleanest solution, no proxy needed for failover.
- MySQL Connector/J — provide multiple hosts, client tries each on failure. Also supports automatic read/write splitting.
- PostgreSQL clients (JDBC, psycopg2) — support multiple hosts but detection relies on connection timeout (
30s default unless tuned). Slower than a proxy (5s).
Client failover handles the routing problem but not the connection count problem — 10 app servers × 50 client-side connections each = 500 DB connections regardless. A proxy collapses that to 20. Both are often used together: client for failover awareness, proxy for connection pooling.
Real example: Almost every high-scale SQL deployment uses a connection pooler. Without it, 500 app servers × 20 threads = 10,000 DB connections — most SQL databases struggle above ~500–1,000 concurrent connections.
Database Sharding
Splits data across multiple DB nodes by a shard key. Each node owns a subset of the data.
NFR triggers: Write-heavy scale beyond what a single DB can handle (~5K–10K writes/sec for PostgreSQL).
Shard key choice matters:
- By user ID: Common. All data for a user lives on one shard. Good for user-scoped queries. Risk: hot shards if some users are far more active.
- By geography: US users on US shard, EU on EU shard. Good for data residency compliance.
- By hash: Distributes evenly but makes range queries hard.
Consistent hashing: Maps both data keys and nodes onto a hash ring. When a node is added/removed, only the adjacent keys are remapped — not the entire dataset. Minimizes resharding cost.
Cross-shard queries are expensive. Design your shard key so most queries hit one shard. Avoid joins across shards.
Limits & when to shard:
- Single PostgreSQL/MySQL: ~5K–10K writes/sec and ~1–2TB data before sharding becomes necessary
- Each shard is a full independent DB — write throughput scales linearly with shard count
- Resharding (changing shard count) is painful — consistent hashing minimizes but doesn't eliminate this
Real example: Discord shards message data by channel ID. All messages in a channel live on one shard.
Time-Series Database
Optimized for append-only writes of timestamped data. Efficient range queries and aggregations over time windows.
NFR triggers: Metrics, monitoring, IoT sensor data, financial tick data.
Why not regular DB: Time-series data is always appended (never updated), queried by time range, and often aggregated (avg, max, sum over 5-minute windows). Regular DBs aren't optimized for this pattern.
Real example: Prometheus stores metrics as time-series. Grafana queries it for dashboards. Influx DB for IoT sensor readings.
Bloom Filter
Probabilistic data structure that answers "is this item in the set?" in O(1) with zero false negatives and a small false positive rate.
NFR triggers: Avoid expensive DB lookups for items that almost certainly don't exist.
How it works: Uses multiple hash functions and a bit array. add(x) sets several bits. contains(x) checks those bits — if any are 0, x is definitely not in the set. If all are 1, x is probably in the set (small chance of false positive).
Limits & sizing:
- 10 bits per element → ~1% false positive rate (rule of thumb)
- 1 billion elements at 1% FPR: ~1.2GB RAM
- 1 billion elements at 0.1% FPR: ~1.8GB RAM
- Cannot delete elements — use a Counting Bloom Filter if deletes are needed
- Zero false negatives — if it says "not in set", it's definitely not there
Real example: Google Chrome uses a Bloom filter to check if a URL is in the safe browsing list before hitting the network. Cassandra uses one to skip SSTable lookups for keys that don't exist.
Search
Strategies for full-text search — from staying in your existing database to running a dedicated search engine.
PostgreSQL Full-Text Search
Built into PostgreSQL with no extra infrastructure. Uses tsvector (a preprocessed document representation) and tsquery (a search query), indexed with GIN for fast lookup.
How it works:
-- Add a tsvector column (or use an expression index)
ALTER TABLE events ADD COLUMN search_vector tsvector;
-- Populate it (to_tsvector handles tokenization, stemming, stop words)
UPDATE events SET search_vector =
to_tsvector('english', coalesce(name, '') || ' ' || coalesce(description, ''));
-- GIN index makes queries O(log n) instead of full table scan
CREATE INDEX idx_events_search ON events USING GIN(search_vector);
-- Query: much faster than LIKE '%Taylor%'
SELECT * FROM events WHERE search_vector @@ to_tsquery('english', 'Taylor');
vs LIKE: LIKE '%Taylor%' forces a full table scan — O(n). tsvector + GIN is O(log n). At 10M rows, that's the difference between seconds and milliseconds.
Limitations:
- No fuzzy search (typo tolerance) —
Taylerwon't matchTaylor - Relevance ranking is basic — no machine learning, no popularity signals
- Slower to update than Elasticsearch — index must be rebuilt or kept in sync on writes
- Comfortable up to ~10M documents; degrades at higher volumes
When it's enough: Event search for moderately sized datasets where exact-word matching is acceptable and you don't want to run separate infrastructure.
Elasticsearch
A distributed search engine built on Apache Lucene. Excels at full-text search, fuzzy matching, faceted filtering, and aggregations at massive scale.
Core concept — inverted index:
Instead of storing documents and scanning them, Elasticsearch builds a map of every word → which documents contain it:
"Taylor" → [event:123, event:456, event:789]
"Swift" → [event:123, event:890]
"Concert" → [event:123, event:456, event:201]
A search for "Taylor Swift Concert" intersects these posting lists in microseconds — no scanning required. This is fundamentally different from a B-tree index, which works on column values, not free text.
Fuzzy search: Elasticsearch uses edit distance (Levenshtein) to match near-spellings. Tayler → matches Taylor because they differ by one character. Configurable: fuzziness: 1 allows one edit, fuzziness: 2 allows two. SQL databases cannot do this without custom extensions.
Built-in caching:
- Shard-level filter cache — caches results of filter queries (e.g.
status = 'active', date ranges). Reused across different queries that share the same filter. - Request cache — caches full search responses at the shard level for aggregation-heavy queries. Particularly valuable for dashboards and analytics. Automatically invalidated when the underlying shard data changes.
These reduce query processing load on the cluster for repeated or similar queries — you get this without writing any caching code yourself.
Syncing PostgreSQL → Elasticsearch:
| Approach | How | Risk |
|---|---|---|
| Dual write | App writes to DB and ES in the same request | Partial failure — DB succeeds, ES fails → silent divergence. Not recommended. |
| Scheduled batch | Cron job queries DB for changes, bulk-indexes to ES | High latency (minutes), misses hard deletes cleanly |
| CDC (Debezium) | Reads PostgreSQL WAL, publishes change events, ES consumer indexes them | Best choice. Captures ALL changes — including from DB migrations, admin tools, and other services outside your app code. Near real-time (~seconds lag). |
| CDC + Kafka | WAL → Debezium → Kafka → ES consumer | Adds durability and replay capability. ES consumer can fall behind and catch up. Worth adding if Kafka is already in your stack. |
CDC is the best default because it's the only approach that captures changes originating outside your application code. If a DB migration bulk-updates event descriptions, CDC picks it up; dual-write and batch sync depend on your app being the only writer.
Search result caching on top of Elasticsearch:
For frequently repeated queries (e.g. search:keyword=Taylor+Swift&date=2024), cache results in Redis:
key: "search:keyword=Taylor Swift&start=2024-01-01&end=2024-12-31"
value: [event1, event2, event3]
TTL: 86400 (24 hours)
Build the cache key from all query parameters so different queries get different cache entries. Invalidation is the hard part — search result sets don't map neatly to a single entity, so tag-based invalidation or a moderate TTL (hours, not days) is the pragmatic choice.
CDN for non-personalized search: If search results are the same for all users (public event search, no personalization), cache API responses at the CDN edge. A Tokyo user gets results from a Tokyo edge node, not your Virginia origin. Only valid for non-personalized queries — personal recommendations or saved searches should never be CDN-cached.
Operational cost: Elasticsearch runs on JVM, requires a cluster (3+ nodes recommended), and has significant operational overhead. Worth it for complex search, faceted filtering, and log analytics at scale. For simpler use cases, PostgreSQL FTS or Typesense/Meilisearch are easier to run.
Reliability Patterns
Components and strategies that keep the system running — or recovering fast — when individual parts fail.
Circuit Breaker
Stops calling a failing downstream service and returns immediately, giving it time to recover instead of piling on more load.
NFR triggers: Fault tolerance, high availability.
States:
- Closed (normal) — requests pass through. Failures are counted.
- Open (tripped) — too many failures. Skip the call entirely, return fallback immediately (< 1ms). Wait N seconds before probing.
- Half-open (probe) — let one request through. Success → close. Failure → reopen.
Why it matters — cascade failure without circuit breaker:
Recommendation Service slow
→ Homepage threads all block waiting (30s timeout each)
→ Thread pool exhausts
→ Homepage Service crashes
→ Gateway times out waiting for Homepage
→ Total outage — caused by a non-critical service being slow
With circuit breaker: trips after N failures → fast-fail in < 1ms → fallback to generic recommendations → Homepage stays healthy.
What the user sees:
- Without circuit breaker: spinner for 30 seconds → 500 error. DB never recovers because load keeps hammering it.
- With circuit breaker + no fallback: fast 503 in milliseconds. DB recovers in ~30s.
- With circuit breaker + fallback (L1 cache, stale data): user sees slightly stale content. Nobody notices.
Implementation Option 1 — Application code (Resilience4j, pybreaker, gobreaker):
CircuitBreaker breaker = CircuitBreaker.custom()
.failureRateThreshold(50) // open if 50% of calls fail
.waitDurationInOpenState(30s) // stay open 30 seconds
.slidingWindowSize(10) // measure last 10 calls
.build();
User user = breaker.executeSupplier(() ->
database.query("SELECT * FROM users WHERE id = ?", userId)
);
Advantage: business-logic aware. Can circuit break only writes but not reads. Can return a specific fallback value (L1 cache, default response) instead of an error.
Implementation Option 2 — Service mesh (Istio + Envoy), no application code:
outlierDetection:
consecutiveErrors: 5 # trip after 5 consecutive failures
interval: 30s # measured over 30 seconds
baseEjectionTime: 30s # eject the host for 30 seconds
Sidecar proxy (Envoy) is injected next to every service and intercepts all traffic. App code is completely unaware. Coarser — trips on connection failures and timeouts, but can't return a business-level fallback.
Implementation Option 3 — Infrastructure layer:
| Option | What It Does | Limitation |
|---|---|---|
| API Gateway (Kong, AWS API GW) | Circuit breaks external traffic at gateway | Only covers external calls, not internal service-to-service |
| Nginx / HAProxy upstream | Stops routing to dead backends | Detects dead hosts, not slow/struggling ones |
| RDS Proxy (AWS) | Pools DB connections, queues during failover | Solves connection exhaustion, not a full circuit breaker |
Best practice: Use both — service mesh for service-to-service protection, application-level for nuanced fallbacks ("return L1 cache when DB circuit is open").
Real example: Netflix Hystrix (now Resilience4j) wraps every downstream call. If the recommendations service goes down, the circuit opens and Netflix shows a generic list instead of erroring.
Rate Limiter
Controls how many requests a client can make in a time window. Protects the system from abuse and overload.
NFR triggers: Security, fault tolerance, scale.
Algorithms:
| Algorithm | How It Works | Characteristic |
|---|---|---|
| Token Bucket | Bucket refills at fixed rate. Each request consumes a token. | Allows bursts up to bucket size. Smooth average rate. |
| Leaky Bucket | Requests enter a queue (bucket). Processed at fixed rate. Overflow dropped. | Strict constant output rate. No bursts allowed. |
| Fixed Window | Count requests per time window (e.g. per minute). Reset at window boundary. | Simple. Edge case: 2× traffic at window boundaries. |
| Sliding Window | Rolling count over last N seconds. Smooths boundary problem. | Accurate. Slightly more memory. |
Where to implement: API Gateway (per user/IP before reaching app servers).
Redis implementation (Fixed Window): INCR is atomic — two clients can't interleave. This is the canonical Redis rate limit pattern:
INCR user:42:requests // atomic increment
EXPIRE user:42:requests 60 // set TTL on first call (60-second window)
// if value > limit → reject request
For sliding window, use a Redis Sorted Set: store each request as a member with timestamp as score, then ZREMRANGEBYSCORE to drop entries outside the window, ZCARD to count remaining.
Real example: Twitter API limits 300 tweets per 3 hours per user. Stripe rate-limits API keys by requests/second.
Load Shedding
Intentionally rejects low-priority requests when the system is overloaded so high-priority work never fails.
Different from rate limiting, which caps per-client volume. Load shedding is a global capacity decision: the system is full, non-critical work goes first.
NFR triggers: Fault tolerance, availability, scale.
The priority ladder — define this before peak traffic, not during it:
| Priority | Examples | Policy |
|---|---|---|
| P0 — never shed | Checkout, payment, OTP, auth | Always fast-path — never touch |
| P1 — shed last | Inventory check, search, cart | Queue; shed only under extreme load |
| P2 — shed first | Recommendations, personalization, analytics | Drop immediately on overload |
Overload detection — use both signals together:
- Queue depth exceeds 80% of max capacity
- p99 latency exceeds SLA threshold (e.g. 2000ms)
At 80% queue depth you still have headroom. Waiting for 100% means you're already in trouble.
Bounded request queue — hold overflow before shedding: Rather than immediately 503ing when threads are busy, queue the request briefly with a TTL:
Request arrives → thread free? → process immediately
→ thread busy → queue (max 1000, TTL 30s, ordered by priority)
→ queue full → 503 + Retry-After
Priority queue (min-heap) ordered by P0/P1/P2, then arrival time. When a thread frees, it always picks the highest-priority waiting request. Requests that expire in queue are dropped — processing a 30-second-old request returns a response to nobody.
P2 fallback — don't queue, return a default instantly:
if (isOverloaded()) {
return Collections.emptyList(); // UI shows nothing, no crash
}
return recommendationService.fetch(userId);
This frees CPU immediately — the request is never processed at all.
Always include Retry-After on every 503:
HTTP 503 Service Unavailable
Retry-After: 10
Without it, every client retries immediately → thundering herd → traffic spike at exactly the wrong moment. With it, clients stagger retries across 10 seconds and load spreads out.
How load shedding, virtual queue, and circuit breaker compose:
Incoming request
→ Circuit breaker: is downstream healthy? NO → fast-fail (don't even try)
→ Load shedding: are we overloaded? P2 request? → 503 immediately
→ Virtual queue: thread free? YES → process. NO → queue → drain → process
Each layer handles a different failure mode. Circuit breaker: downstream is broken. Load shedding: we're at capacity, cut by priority. Virtual queue: burst is temporary, buffer and drain.
Real example: Black Friday at any major e-commerce company — recommendations and personalization are the first features disabled. Checkout and payment are the last to be touched, typically never.
Idempotency Keys
Ensures a request produces the same result no matter how many times it's sent. Critical for retries.
NFR triggers: Durability, consistency, payments.
How it works: Client generates a unique key (UUID) and sends it with the request. Server stores (idempotency_key → result). On duplicate request, returns stored result instead of re-executing.
Why it matters: Networks fail. Clients retry. Without idempotency, a payment retry = double charge.
Real example: Stripe requires an Idempotency-Key header on all payment requests. POST to /charges twice with the same key returns the same charge object, not two charges.
Consistent Hashing
Distributes data across nodes such that adding or removing a node only remaps a small fraction of keys.
NFR triggers: Scale (sharding, distributed cache), fault tolerance.
How it works: Nodes and keys are both hashed onto a circular ring (0 to 2³²). A key is assigned to the first node clockwise from its position. When a node is added, it takes keys from only its clockwise neighbor. When removed, its keys move to the next node.
Virtual nodes: Each physical node has multiple positions on the ring. Distributes load more evenly and handles heterogeneous hardware.
Limits & behaviour:
- Adding a node: remaps only ~1/n of keys (n = total nodes) — compared to 100% remapping with naive modulo hashing
- Virtual nodes: 100–200 per physical node is typical for even distribution
- Overhead: O(log n) lookup on the ring — negligible
Real example: Cassandra and DynamoDB use consistent hashing to partition data across nodes. Memcached clients use it to distribute cache keys.
API Security
How clients prove who they are, how tokens are issued and stored, and how requests are protected from tampering and replay. Covers user-facing JWT auth and server-to-server HMAC signing.
JWT Auth — The Login Flow
The complete sequence from submitting credentials to making authenticated requests.
Step 1 — User submits credentials:
POST /auth/login
Body: { "email": "user@example.com", "password": "secret" }
Step 2 — Server validates and issues tokens:
- Look up user by email in the DB
- Hash the submitted password and compare to stored hash (bcrypt)
- If valid, create two tokens — an access token and a refresh token (see Two-Token Model below for what each is and why two)
- Return both tokens to the client
Response: 200 OK
{
"access_token": "eyJhbGci...", ← JWT, short-lived (15 min)
"refresh_token": "d8f3a9bc..." ← opaque string, long-lived (30–90 days)
}
Step 3 — Client stores the tokens:
| Platform | Access Token | Refresh Token |
|---|---|---|
| Web (browser) | Memory (JS variable) | HttpOnly cookie — browser sends it automatically, JS cannot read it |
| iOS | Memory | Keychain (hardware-backed encryption) |
| Android | Memory | Android Keystore |
Never store tokens in localStorage — any XSS script on the page can read it.
- localStorage is the browser's built-in key-value store (~5 MB per site, persists across tabs and browser restarts). Any JS running on the page has full read access — no restrictions.
- XSS (Cross-Site Scripting) is an attack where malicious JavaScript gets injected into your page — through a compromised third-party script, a comment field that wasn't sanitised, or a browser extension. That script can call
localStorage.getItem("access_token")and send it to an attacker's server. Storing tokens in memory (a JS variable) instead means they disappear when the tab closes and are not accessible outside your own code.
Step 4 — Client makes authenticated requests:
GET /api/orders
Authorization: Bearer eyJhbGci... ← access token in header
The server validates the JWT signature on every request — no DB call needed. If valid, the request goes through. If expired, the client gets a 401 and silently refreshes (see Two-Token Model below).
JWT Structure — What's Inside the Token
A JWT is three base64-encoded parts separated by dots: header.payload.signature
Header — algorithm and token type:
{
"alg": "HS256",
"typ": "JWT"
}
Payload — claims about the user (not secret — anyone can decode this):
{
"sub": "user_123", ← subject: user ID
"email": "user@example.com",
"roles": ["user"],
"iat": 1705329751, ← issued at (Unix timestamp)
"exp": 1705330651 ← expires at (15 min later)
}
Signature — proves the token hasn't been tampered with:
HMAC_SHA256(base64(header) + "." + base64(payload), SERVER_SECRET_KEY)
The server re-runs the signature calculation on every request. If the payload was modified, the signature won't match — request rejected. The server never needs to look up the DB to validate; the signature is the proof.
What to put in the payload: User ID, email, roles. Nothing sensitive (passwords, card numbers) — the payload is just base64, not encrypted.
Two-Token Model
The two-token model is not an alternative to JWT — it uses JWT. The access token is a JWT (structured, signed, self-verifying). The refresh token is a plain opaque string stored in the DB. Together they solve the tension between stateless speed and the ability to log users out.
Why this model exists — the history:
The original approach was session-based auth (one token). Server issues a session ID after login, stores the full session in DB/Redis, client stores the session ID in a cookie. Every request the server looks up the session ID in the DB to find the user.
POST /auth/login → server creates session in DB → returns session_id cookie
GET /api/orders → server looks up session_id in DB → gets user → processes request
Simple and fully revocable — to log someone out, just delete the session row. But at scale (millions of users), every API call hits the DB. That becomes a bottleneck.
JWT solved the DB lookup — validation is stateless (just verify the signature). But that created a new problem: you can't revoke a JWT. If a user gets compromised and you set expiry to 7 days, that's a 7-day window of exposure with no way to cut it short.
The two-token model is the compromise:
| Model | Revocable | DB on every request | Exposure if stolen |
|---|---|---|---|
| Session-based (one token) | Yes — delete the row | Yes | Until revoked |
| Single long-lived JWT | No | No | Until expiry (could be days) |
| Two-token (JWT + refresh) | Yes (via refresh token) | No | 15 min max |
One token for speed, one for control.
- Access token — short-lived (15 min), sent with every API request, validated stateless by the server
- Refresh token — long-lived (30–90 days), sent only to
/auth/refresh, stored in DB so it can be revoked
Why two tokens: A stolen access token expires in 15 minutes. The refresh token never goes to the resource API — it only touches the auth endpoint, which can check the DB and revoke it if needed.
Refresh is silent — the user never notices:
After 15 min the client automatically calls /auth/refresh in the background. The refresh token is in an HttpOnly cookie so the browser sends it automatically — no user action needed. The user keeps browsing without any interruption.
The user only has to log in again when:
- The refresh token itself expires (30–90 days) — the "remember me" window
- The user explicitly logs out (refresh token revoked in DB)
- An admin revokes the session (compromise response)
- The user clears their cookies manually
On a site you use daily, you might go months without seeing a login screen.
Refresh flow:
[Access token expires — user doesn't see this]
→ GET /api/orders with expired access token
← 401 Unauthorized
→ POST /auth/refresh
Cookie: refresh_token=d8f3a9bc... (sent automatically by browser)
← 200 OK: { "access_token": "eyJhbGci..." } (+ new refresh token)
→ Retry GET /api/orders with new access token
← 200 OK — user gets their data, never knew anything happened
Token rotation: Issue a new refresh token on every refresh and immediately invalidate the old one. If an attacker steals a refresh token and tries to reuse it after the legitimate user already refreshed, the DB detects the reuse and can revoke the entire session.
Interview tip: Access token = stateless speed. Refresh token = revocable control. The split is the whole point.
HMAC Request Signing
Proves a request came from a legitimate server or merchant and hasn't been tampered with in transit. Used for server-to-server calls and webhooks (Stripe, Twilio, GitHub). Not for browser users — that's JWT.
NFR triggers: Security, B2B API authentication.
How it works:
At signup, generate two keys per merchant:
- API Key (
pk_live_123) — public identifier, sent with every request - Secret Key (
sk_live_abc) — private, never sent over the wire, used only for signing
Per request, the caller builds a canonical string and signs it:
Canonical string = method + path + body + timestamp + nonce
Signature = HMAC_SHA256(secretKey, canonicalString)
Headers sent:
Authorization: Bearer pk_live_123
X-Request-Timestamp: 2024-01-15T14:22:31Z
X-Request-Nonce: a1b2c3
X-Signature: sha256=7f83b1657ff1fc53...
Server verifies: look up secret key via API key → recompute signature → compare → check timestamp freshness → validate nonce hasn't been used before.
API Key = who you are. Signature = proof you're legit. Timestamp + Nonce = proof it's fresh.
Nonce
A random value used exactly once per request to prevent replay attacks — an attacker who captures a valid signed request cannot replay it later.
How it works: Caller generates a UUID per request, includes it in the canonical string and headers. Server stores seen nonces short-term (5–10 minutes, matching the timestamp window). If a nonce appears twice, the second request is rejected — even within a valid timestamp window.
Nonce vs Idempotency Key:
| Nonce | Idempotency Key | |
|---|---|---|
| Purpose | Security — reject replayed requests | Correctness — safe retries, no double charge |
| On duplicate | Request rejected | Cached result returned |
| Stored for | 5–15 minutes | 12–24 hours |
| Used by | API security layer | Business logic layer |
Stripe relies on HTTPS to prevent replays and uses idempotency keys for correctness — no nonce. Systems with higher security requirements add signing + nonce on top of HTTPS for defence-in-depth.
HSM (Hardware Security Module)
Tamper-resistant hardware that generates, stores, and uses cryptographic keys. Private keys never leave the device in plaintext — even a full server compromise can't extract them.
NFR triggers: Security (PCI-DSS compliance), encryption key management for payment platforms.
How it works in a payment flow:
- Browser encrypts card data using your public key (inside a secure iframe served from your domain)
- Encrypted payload sent over HTTPS — plaintext card data never exists outside the browser
- Backend passes ciphertext to HSM
- HSM decrypts internally — plaintext returned to backend only
- Backend uses it briefly for processing, then discards
- Your backend never sees the private key
Why iframe isolation matters: Browser same-origin policy isolates the iframe (served from pay.platform.com) from the merchant's page. Merchant JS cannot read what the customer types. Card data never touches merchant servers.
What to store in HSM: Card encryption private key, Root Key / Key Encryption Key (KEK), tokenization keys. What NOT to store: Merchant API secrets, TLS certificates — lower trust tier, managed separately.
Cloud options: AWS CloudHSM, AWS KMS (managed), Google Cloud HSM, Azure Key Vault HSM.
Breach comparison:
- Without HSM: attacker extracts private key → decrypts all historical card data
- With HSM: key is unextractable → attacker may intercept active requests but cannot decrypt historical data
PCI-DSS requires or strongly recommends HSMs for payment platforms. Look for FIPS 140-2 / 140-3 certification.
App Attestation (Mobile Only)
Proves to your backend that the request is coming from a genuine, unmodified version of your app — not a script, emulator, or repackaged APK. JWT tells you who the user is; attestation tells you what the client is.
Apple App Attest (iOS 14+):
- App calls Apple's DeviceCheck API → receives a cryptographic assertion
- App sends assertion with each request
- Server sends assertion to Apple's validation service
- Apple confirms: genuine device + unmodified app binary
Google Play Integrity API (Android):
- Google issues a signed verdict: MEETS_DEVICE_INTEGRITY, MEETS_BASIC_INTEGRITY, or NO_INTEGRITY
- NO_INTEGRITY → rooted device, emulator, or repackaged APK → reject or challenge
Practical rules:
- Enforce on sensitive endpoints only (login, payment, signup) — attestation is slow (~200–500ms round trip)
- Cache the result server-side per device for 30 min — don't call Apple/Google on every request
- Fail open on attestation service errors — fall back to rate limiting
- Flag NO_INTEGRITY as high-risk, don't necessarily block — legitimate users on rooted devices exist
Common Attack Vectors
| Attack | What Happens | Defense |
|---|---|---|
| Credential stuffing | Attacker tries leaked username/password pairs at scale | Rate limit login endpoint, require MFA, flag anomalous login velocity |
| JWT theft | Access token extracted from insecure storage | Short TTL (15 min) + store in memory/Keychain/Keystore, never localStorage |
| Refresh token theft | Long-lived token stolen from cookie or storage | HttpOnly cookie (not readable by JS), token rotation with reuse detection |
| MITM | Attacker intercepts traffic on public WiFi | TLS everywhere + certificate pinning on mobile |
| Replay attack | Attacker captures and replays a valid signed request | Nonce + timestamp window on HMAC-signed APIs |
| IDOR | User A modifies request to access User B's resource | Server-side ownership check on every request — never trust client-provided user IDs |
| Repackaged APK | Attacker strips SSL pinning from app, redistributes | App attestation (Google Play Integrity rejects modified binaries) |
Certificate pinning tradeoff: Pinning prevents MITM but breaks on certificate rotation — requires a forced app update. Use Public Key Pinning (pin the key, not the cert) so cert rotation doesn't break the pin.
Auth Flow — Full Picture
How login, token storage, API requests, and refresh fit together end-to-end:
[Browser / Mobile App]
│
├── 1. POST /auth/login { email, password }
│ ↓
│ [Auth Service]
│ verify password hash (bcrypt)
│ issue access token (JWT, 15 min) ──────────────────────────────┐
│ issue refresh token (opaque, 90 days, stored in DB) ───────────┤
│ ↓ │
│ Response: access_token in body, refresh_token in HttpOnly cookie │
│ │
├── 2. GET /api/resource │
│ Authorization: Bearer <access_token> ◄────────────────────────────┘
│ ↓
│ [API Gateway]
│ validate JWT signature (stateless, no DB call)
│ extract user ID + roles from payload
│ ↓
│ [Backend Service]
│ check resource ownership (IDOR prevention)
│ return data
│
├── 3. Access token expires → 401 received
│
└── 4. POST /auth/refresh
Cookie: refresh_token=... (sent automatically)
↓
[Auth Service]
look up refresh token in DB
validate not revoked, not expired
issue new access token
rotate refresh token (invalidate old, issue new)
↓
Response: new access_token
→ retry original request
Distributed Coordination
How distributed nodes agree on who's in charge, who's alive, and what the current config is.
Coordination Services — etcd, Consul, ZooKeeper
All three solve the same problems: leader election, distributed locks, config management, service discovery. They differ in design philosophy and ecosystem.
| etcd | Consul | ZooKeeper | |
|---|---|---|---|
| Consensus | Raft | Raft | ZAB (similar to Paxos) |
| Primary use | Kubernetes cluster state, Postgres HA (Patroni) | Service discovery + health checks + service mesh | Older Kafka, HBase, Hadoop |
| Data model | Key-value | Key-value + service catalog | Hierarchical (filesystem-like znodes) |
| Relevance today | High — etcd is the modern default | High — strong in service mesh space | Declining — Kafka dropped it in v3.x |
| Consistency | Strong (CP) | Strong (CP) | Strong (CP) |
What they all provide:
- Leader election — only one node acts as primary at a time
- Distributed locks — only one worker processes a job at a time
- Watch / notify — clients watch a key and are notified instantly on change
- Health-based membership — nodes register themselves, dead nodes are removed
Consul in depth — built for microservices:
Service registration and discovery is Consul's primary job. Every service registers itself on startup with its name, address, port, and tags. Other services look up healthy instances by name — via DNS (my-service.service.consul) or HTTP API. No hardcoded IPs anywhere.
Service registration flow:
payment-service starts
→ registers with Consul: "I am payment-service at 10.0.1.5:8080"
→ Consul runs health check every 10s (HTTP /health endpoint)
order-service wants to call payment-service
→ queries Consul: "give me healthy instances of payment-service"
→ Consul returns [10.0.1.5:8080, 10.0.1.9:8080]
→ order-service picks one (or uses DNS round-robin)
payment-service instance crashes
→ health check fails → instance removed from pool automatically
→ order-service's next query gets only healthy instances
Service mesh (Consul Connect): Consul can inject a sidecar proxy next to each service that handles mTLS encryption and traffic policies — without any application code changes. Services talk to their local sidecar; the sidecar handles identity, encryption, and routing.
Multi-datacenter: Built-in. Consul federates across DCs natively. Services in US-East can discover services in EU-West through the same API.
etcd vs Consul for microservices:
| etcd | Consul | |
|---|---|---|
| Best for | Kubernetes cluster state, leader election | Microservice networking, service discovery |
| Health checking | No | Yes — active checks every N seconds |
| Service mesh | No | Yes (Connect sidecar) |
| Multi-DC | No | Yes — native federation |
| Discovery API | Manual (just a KV store) | First-class (DNS + HTTP) |
Kubernetes shops tend to use etcd (it's already there) + kube-dns for discovery. Non-Kubernetes microservice shops reach for Consul.
ZooKeeper — legacy context:
Built at Yahoo in 2007, open-sourced in 2008 — the original distributed coordination tool. Uses ZAB (ZooKeeper Atomic Broadcast), a custom consensus protocol similar to Paxos. Became the backbone for Kafka, HBase, and Hadoop.
Key reasons it's being replaced:
- Kafka dropped it in v3.x (KRaft mode): ZooKeeper was a scaling bottleneck — each partition change required a ZooKeeper round-trip. Kafka now has built-in Raft consensus.
- Heavier to operate: Requires a separate ZooKeeper cluster with its own JVM processes.
- More complex API: Hierarchical filesystem model (znodes) is harder to work with than etcd's flat key-value model.
You'd still see it in legacy Hadoop/HBase stacks or older Kafka deployments (pre-3.x). You wouldn't choose it for a new system today.
Eureka — Netflix OSS, eventual consistency:
Service discovery only — no locks, no KV store, no active health checking. Services self-report their health via periodic heartbeat. Eureka favors availability over consistency: if Eureka's replication network splits, each node keeps serving its own view rather than rejecting queries.
- If a service crashes without deregistering, it stays in the pool for ~90 seconds (heartbeat timeout) before eviction — you can get routed to a dead instance briefly
- Lives in the Spring Cloud / Netflix OSS ecosystem — common in Java microservice shops on AWS
- Simpler to run than Consul but weaker guarantees — acceptable when occasional routing to a dead instance is survivable and you have retry logic
In an interview: Say "I'd use etcd for leader election" or "Consul for service discovery in a microservices stack." ZooKeeper is worth knowing for context (Kafka history) but wouldn't appear in a new design today. Eureka is fine to mention if you're discussing Netflix-style eventual-consistency tradeoffs.
Gossip Protocol
How nodes in a cluster discover each other and spread state without a central coordinator. Used by Cassandra, Redis Cluster, DynamoDB, Consul.
How it works:
- Every node maintains a list of known peers with their state (alive/dead, metadata).
- Every N seconds, each node picks K random peers and exchanges its state list.
- New information spreads like a rumor — each exchange infects more nodes.
- Within O(log n) rounds, all nodes know about all changes.
Node A learns Node D joined
→ A tells B and C
→ B tells E and F, C tells G and H
→ All nodes know about D in log(n) rounds
What it's used for:
- Failure detection — if a node stops gossiping, peers mark it as suspect, then dead
- Membership — who's in the cluster right now?
- Metadata propagation — spread token ranges (Cassandra), shard assignments, config changes
Properties:
- Fully decentralised — no single point of failure
- Eventually consistent — all nodes converge on the same view, but not instantly
- Scales well — O(log n) convergence regardless of cluster size
- Not suitable for strong consistency — use etcd/Raft for that
Real examples: Cassandra uses gossip for all cluster membership. Redis Cluster uses it to propagate slot assignments. Consul uses it for node health state.
Raft Consensus
The algorithm behind etcd, CockroachDB, Consul, and Kafka KRaft. Worth understanding conceptually — you'll reference it when explaining how leader election and replication work.
Core idea: A cluster of N nodes elects one leader. All writes go through the leader. Leader replicates to a majority (quorum) before committing. As long as a majority of nodes are alive, the cluster makes progress.
Three roles:
- Leader — handles all writes, sends heartbeats to followers
- Follower — accepts log entries from leader, votes in elections
- Candidate — node that has timed out waiting for heartbeats and is requesting votes
Leader election:
- Follower receives no heartbeat → becomes Candidate → increments term → requests votes
- First candidate to get majority votes becomes Leader
- Leader sends heartbeats — if followers stop hearing them, new election starts
Log replication:
- Client sends write to Leader
- Leader appends to its log, sends to all Followers
- Once majority confirm → Leader commits → responds to client
- Followers apply committed entries
Quorum = majority = n/2 + 1. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures.
In an interview: "We'd run etcd with a 3-node cluster using Raft — tolerates one node failure, leader election completes in < 1 second."
Distributed Locks
Prevents two workers from processing the same job or two leaders from writing simultaneously.
Why not application-level locking?
The naive approach: each service instance tracks a set of "locked" resource IDs in its own memory. When it locks a resource, it starts a local timer. When the timer expires, it releases the lock.
This breaks down immediately in a multi-instance deployment:
- No coordination — two instances can both see a resource as unlocked simultaneously and both acquire it. Race condition is guaranteed.
- Crash orphans the lock — if an instance crashes while holding a lock, no other instance knows about it. The resource stays locked indefinitely.
- Scales inversely — the more instances you run, the higher the probability of a collision. Exactly the wrong direction.
Concrete example: Ride Matching Service with 10 instances. A rider requests a match. Instance A locks driver:42 and sends a request. Instance B (handling a different rider) also locks driver:42 at the same moment — neither instance knew the other had it. Driver gets two simultaneous ride requests.
The fix: move the lock out of instance memory into a shared, atomic, external store — which is exactly what Redis SETNX provides.
Why not just use a database transaction lock? DB transaction locks are designed for millisecond-duration operations — they hold for the life of a transaction. Distributed locks are for longer-duration holds, like reserving a concert ticket for 10 minutes while a user completes checkout. Using a DB transaction for that would block the table and kill throughput.
Real-world use cases:
- Ticketing / e-commerce checkout — lock a seat or limited-stock item for ~10 minutes while user completes payment; no one else can grab it
- Ride-share driver assignment — lock a nearby driver when a rider requests; prevents the driver being matched to two riders simultaneously
- Distributed cron jobs — ensure a scheduled task (e.g. daily report aggregation) runs on exactly one server even though many servers are eligible
- Auction bidding — briefly lock an item in the final seconds of bidding to process each bid sequentially and update the highest bid atomically
Locking granularity: You can lock a single resource (one ticket) or a group (all seats in a section). Finer granularity = more concurrency but more lock management overhead.
Option 1 — Redis SETNX (simple, fast):
SET lock:job_123 worker_A NX PX 30000
NX = only set if not exists. PX 30000 = expire in 30 seconds (prevents deadlock if worker crashes).
- Fast (~1ms)
- Single Redis node = single point of failure
- Clock drift can cause two nodes to hold the lock simultaneously
Option 2 — Redlock (Redis, multiple nodes): Acquire lock on majority (3 of 5) Redis nodes within a time window. More robust than single-node but controversial — clock drift still a theoretical issue.
Option 3 — etcd lock (strongest guarantee): etcd uses Raft — lock acquisition is linearizable. No clock drift issues. Slower than Redis (~5–10ms) but correct.
The Redlock controversy: Martin Kleppmann (author of Designing Data-Intensive Applications) argued that Redlock is unsafe because a process can pause mid-execution (JVM GC pause, OS scheduling) after acquiring the lock. By the time it resumes, the TTL may have expired across all Redis nodes, and another process has acquired the same lock. Both now believe they hold it — two workers execute the critical section simultaneously. Redlock has no way to detect this.
Antirez (Redis creator) rebutted that the scenario requires extreme GC pauses unlikely in practice. The debate was never fully resolved. The practical takeaway: Redlock is fine for workloads where occasional double execution is survivable. For true safety, use etcd with fencing tokens.
Fencing tokens — the correct solution for strong safety:
Every etcd write returns a monotonically increasing revision number. You pass this number to the protected resource as a token. The resource rejects any request whose token is lower than the last one it saw — so even if a stale process wakes up after its lock expired, the resource rejects it.
Process A acquires lock → gets token 42
Process A pauses (GC)
Lock expires → Process B acquires lock → gets token 43 → writes with token 43
Process A wakes up → tries to write with token 42 → resource rejects (42 < 43) ✓
Redis has no equivalent — it cannot detect that a process holding a "valid" lock is actually stale. If you need this level of correctness, use etcd.
When to use each:
- Redis SETNX — job deduplication, idempotency, rate limiting. OK if occasional double processing is survivable.
- etcd lock — leader election, critical section where correctness is mandatory (financial, inventory).
Deadlocks: Occur when two processes each hold one lock and wait for the other. Process A holds lock-X and wants lock-Y; Process B holds lock-Y and wants lock-X — both wait forever. Prevention: always acquire multiple locks in a consistent global order, and keep lock acquisition centralized in one place in code so the ordering is easy to enforce and audit.
Read-path complexity with Redis locks: When reservations live in Redis rather than the DB, reads (e.g. showing a seat map) need to know which seats are locked. Two patterns:
- Redis Set per event: Maintain a
event:{eventId}:reservedSet alongside each lock. Seat map query does one Redis round-trip to get all reserved IDs — fast in practice. - Write-through to DB: On lock acquire, write a "reserved" status to the DB row. Redis TTL remains the source of truth for expiry; a periodic sweep cleans up stale DB reservations. Seat map reads from DB normally — no extra Redis call.
Concurrency Control at the Database Layer
Distributed locks (Redis, etcd) handle the coarse-grained, longer-duration reservation window. But the actual write — when the ticket is finally booked — needs DB-level concurrency control as the last line of defence. Two mechanisms:
OCC — Optimistic Concurrency Control
Assumes conflicts are rare. Don't lock upfront — detect collisions at write time using a version field.
How it works:
- Read the row — it has
version = 5 - Do your work (fill in payment details, build the update)
- Write with a version check:
UPDATE ticket SET status='BOOKED', version=6 WHERE id=123 AND version=5 - If another transaction already incremented to version 6, your
WHEREmatches 0 rows → conflict detected → throw error → caller retries or issues refund
Hibernate / JPA makes this automatic:
@Entity
public class Ticket {
@Id
private Long id;
@Version
private int version; // Hibernate reads, checks, and increments this automatically
}
Hibernate generates the WHERE version = X check on every UPDATE. If the check fails, it throws OptimisticLockException. Works the same in any JPA provider (EclipseLink, etc.) and Spring Data JPA inherits it directly.
Best for: High-read, low-conflict workloads. Ticket booking (most users don't race), inventory, order creation. Fails loudly on conflict — caller must handle the retry or compensate (e.g. automatic refund).
Pessimistic locking — SELECT FOR UPDATE
Acquires a row-level lock at read time. No other transaction can write that row until you commit or roll back.
BEGIN;
SELECT * FROM ticket WHERE id = 123 FOR UPDATE; -- row locked now
-- do work
UPDATE ticket SET status = 'BOOKED' WHERE id = 123;
COMMIT; -- lock released
DB support:
| DB | Support | Notes |
|---|---|---|
| PostgreSQL | Yes | Full row-level locking |
| MySQL / MariaDB | Yes | InnoDB engine only |
| Oracle | Yes | Also supports SKIP LOCKED |
| SQL Server | Different syntax | WITH (UPDLOCK, ROWLOCK) hints |
| SQLite | No | File-level locking only |
| MongoDB | No | Use findAndModify / conditional writes |
| DynamoDB | No | Use ConditionExpression (same OCC idea, different API) |
| Cassandra | No | Lightweight transactions (IF clause) — expensive, avoid |
SKIP LOCKED variant (PostgreSQL, MySQL 8+, Oracle): Skips rows already locked by another transaction instead of blocking. Perfect for job queues — multiple workers each grab a different job row without blocking each other.
SELECT * FROM jobs WHERE status = 'pending'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 1;
Best for: Short-duration critical sections where blocking is acceptable — final booking step, inventory decrement, job queue polling. Not for the 10-minute reservation window (blocks the row for too long, kills concurrency).
Coupling caveat: SELECT FOR UPDATE ties you to a relational DB that supports it. SQL Server uses different syntax (WITH (UPDLOCK, ROWLOCK) hints). SQLite, DynamoDB, MongoDB, and Cassandra don't support it at all — you'd have to rearchitect concurrency control entirely if you ever migrated. Even with an ORM, LockModeType.PESSIMISTIC_WRITE (Hibernate/JPA) abstracts the syntax but not the requirement for a relational DB with row locking.
OCC has no such coupling — the version check pattern (WHERE version = X) is implementable on any storage engine: relational DBs, DynamoDB (ConditionExpression), MongoDB (findAndModify with a version filter), even Redis with a Lua script. Prefer OCC by default; reach for SELECT FOR UPDATE only when you specifically need blocking semantics (e.g. SKIP LOCKED for job queues, where OCC would cause constant retry storms under high contention).
The layered model for ticketing (or any reservation system):
Layer 1 — Redis distributed lock (TTL ~10 min)
Holds the seat for one user during checkout
Fast, doesn't touch DB, releases on TTL or explicit unlock
Layer 2 — DB OCC (version check) or SELECT FOR UPDATE
Fires at the moment of final booking write
Safety net: if Redis lock fails or TTL races, DB ensures only one write wins
If both layers fail → idempotency key on the request as last resort
TTL-expires-mid-payment: If User A's lock TTL expires during payment, User B can grab it. When both try to write, OCC ensures only one succeeds. The loser gets a conflict error → issue automatic refund via the payment processor. Mitigation: set TTL generously (10 min for checkout) and extend the TTL when payment is initiated.
Service Discovery
How services find each other's IP addresses in a dynamic environment where instances come and go.
The problem: Service A needs to call Service B. Service B runs on 10 dynamic instances whose IPs change on every deploy. Hardcoding IPs doesn't work.
Two patterns:
Client-side discovery:
Service A → query Consul/etcd → get list of Service B instances → pick one → call directly
Service A is responsible for load balancing. More control, more complexity in the client.
Server-side discovery:
Service A → call load balancer → LB queries registry → routes to Service B instance
Client is dumb — just hits one endpoint. LB handles discovery. Kubernetes Service works this way.
Health checks: Registry continuously health-checks registered instances. Unhealthy instances removed from the pool automatically. New instances register themselves on startup.
Real examples:
- Kubernetes: pods register with kube-dns, Services route via iptables rules
- AWS ECS + ALB: tasks register with target group, ALB discovers healthy tasks
- Consul: services self-register with health check URLs, clients query Consul DNS
Change Data Capture (CDC)
Streams every database insert, update, and delete to an immutable event log — without changing application code. The DB stays optimised for CRUD; the event stream provides audit durability, downstream sync, and analytics.
NFR triggers: Durability, auditability, compliance (PCI-DSS, SOX), event-driven architecture.
How it works:
[App] → [Primary DB] → [CDC Connector reads WAL/binlog] → [Kafka] → [Consumers]
(mutable, fast) (immutable, durable)
- App writes to DB normally — no code changes needed
- CDC connector (Debezium, AWS DMS) reads the database transaction log (WAL for Postgres, binlog for MySQL)
- Every committed change emitted as a before/after event to Kafka
- Consumers process events independently: audit service, analytics, reconciliation, webhook delivery
Example CDC events:
{ "op": "insert", "table": "payment_intents",
"after": { "id": "pi_123", "status": "created", "amount": 2500 } }
{ "op": "update", "table": "payment_intents",
"before": { "id": "pi_123", "status": "created" },
"after": { "id": "pi_123", "status": "authorized" } }
Key guarantee: If a change commits to the DB, it appears in the event stream. CDC operates at the database log level — no application code can accidentally skip it.
Common use cases:
- Audit trails for PCI-DSS, SOX, GDPR compliance
- Syncing operational DB to analytics / data warehouse
- Cache invalidation (update Redis when DB row changes)
- Cross-service replication without dual writes
- Reconciliation (correlate internal events with external payment network events)
Tools:
- Debezium — open source, supports Postgres/MySQL/MongoDB/Oracle/SQL Server. Two deployment modes (see below).
- AWS DMS — managed CDC to AWS targets. No Kafka required.
- Kafka Connect — framework that hosts Debezium connectors in the Kafka ecosystem.
- Fivetran / Airbyte — managed CDC for analytics pipelines.
Debezium deployment modes — Kafka is not required:
| Mode | Architecture | When to use |
|---|---|---|
| Debezium + Kafka Connect | WAL → Debezium → Kafka → consumers | Kafka already in your stack. Multiple downstream consumers (ES + cache + audit all read independently). Need replay if a consumer goes down. |
| Debezium Server | WAL → Debezium Server → sink directly | Don't want Kafka ops overhead. Single downstream target. Sinks: Kinesis, Pub/Sub, Redis Streams, HTTP webhook, Elasticsearch directly. |
What you lose without Kafka in the middle:
- Replay — if Elasticsearch goes down for 2 hours, with Kafka it catches up from its last offset on recovery. Without Kafka, those events are gone.
- Fan-out — with Kafka, multiple consumers read the same stream independently (ES + cache invalidation + audit). Without Kafka, you need separate Debezium instances per sink.
- Durability buffer — Kafka absorbs spikes if the sink is slow or temporarily down.
Interview guidance:
- If Kafka is already in your stack → Kafka Connect + Debezium. Fan-out and replay come free.
- If you only need one downstream target and want less infra → Debezium Server directly to that sink.
- On AWS → AWS DMS is a fully managed alternative: no Debezium to operate, integrates natively with RDS, Aurora, and DynamoDB as targets.
- Always mention the replication slot gotcha for PostgreSQL: Postgres holds WAL segments until Debezium confirms it has read them. If Debezium falls behind or stops, WAL grows unboundedly and can fill your disk. Monitor replication slot lag as a first-class alert.
Retention: Hot retention on Kafka (7–30 days for fast replay). Archive to S3 with write-once policy for permanent compliance storage.
Tradeoffs:
- Adds infrastructure (Kafka, connectors)
- Eventual consistency between DB and event stream (~milliseconds)
- Schema changes require planning (adding columns can break consumers)
CDC as a SPOF concern: If the CDC connector stops, DB writes continue but events stop flowing — missed audit records downstream. Mitigations: run redundant CDC instances, aggressive lag monitoring (alert within seconds), maintain replay procedures from WAL offsets to backfill missed events.
Interview one-liner: "We use CDC to stream DB changes to an immutable Kafka log — audit durability without changing application code or impacting primary DB performance."
Object Storage & File Upload
Patterns for handling user-generated file uploads at scale — photos, videos, documents — without routing large payloads through your backend servers.
Terminology used in this section:
- File — the logical entity tracked in your database (has a fileId, metadata, status)
- Object — the physical storage unit in S3 (what S3 actually stores)
- Upload — the act of transferring bytes from client to S3
- Presigned URL — a short-lived, signed URL that grants temporary permission to read or write a specific S3 object
Simple Upload — Presigned URL Pattern
Never proxy file uploads through your backend. At scale, routing even 10MB photos through your API servers exhausts bandwidth, memory, and connection limits. Instead, generate a short-lived presigned URL and let the client upload directly to S3.
Flow:
1. App requests upload permission from backend
POST /uploads { trip_id, photo_type: "pickup", file_size, mime_type }
→ Backend validates: does this user own this trip?
2. Backend generates a presigned S3 URL
→ Scoped to exact key: trips/{trip_id}/pickup/{uuid}.jpg
→ Short TTL: 5–15 minutes
→ Returns { upload_url, key } to client
3. Client uploads directly to S3
PUT {upload_url} ← file bytes go here, not to your servers
→ No auth headers needed — permissions baked into the URL
4. S3 triggers async processing (event notification)
→ s3:ObjectCreated event → SQS/Lambda → processing pipeline
→ Do NOT rely on client notifying backend — client can crash, go offline, or lie
S3 key design matters: Scope the presigned URL to the exact key (trips/{trip_id}/pickup/{uuid}.jpg), not a prefix. Otherwise a malicious user could overwrite another user's files by constructing a different key. The key should include a UUID to prevent collisions.
For large files, a single PUT breaks down in four ways. That's where multipart upload comes in.
Multipart Upload — Large Files
For files larger than ~100MB, the same presigned URL pattern applies — but instead of a single PUT, the file is split into chunks and each chunk gets its own presigned URL.
Why a single upload fails for large files
A single POST request for a large file fails in four ways:
- Timeouts — servers, load balancers, and clients have timeout limits (typically 30s–2 min). A 50GB file on a 100Mbps connection takes 1.1 hours (50GB × 8 / 100Mbps = 4,000s). The connection times out long before the upload finishes.
- Payload size limits — Amazon API Gateway hard-limits requests to 10MB with no override. Most infrastructure has similar caps. A 50GB file simply cannot be sent as a single POST.
- Network interruptions — if the connection drops mid-upload, the entire transfer must restart from zero.
- No progress visibility — the user has no indication of how far along the upload is or whether it's working.
The fix: split the file into chunks and upload each independently.
Identifying files by content (fingerprinting)
Large file uploads introduce two new problems: how do you resume an interrupted upload, and how do you avoid uploading the same file twice? The answer is fingerprinting — computing a SHA-256 hash of the file's content before any upload begins.
You cannot identify a file by its name — two different users can upload files with the same name, and the same user can rename a file. Instead, compute a fingerprint: a SHA-256 hash of the file's content. Same content = same fingerprint, regardless of filename or who uploaded it.
Two fields in your metadata table serve different purposes:
FileMetadata table:
| fileId (UUID) | fingerprint (SHA-256) | ownerId | status |
|---------------|----------------------------|---------|-----------|
| f1 | a3f5c8... | user1 | uploaded |
| f2 | a3f5c8... | user2 | uploading |
- fileId — unique identifier for this file record (UUID, always unique)
- fingerprint — hash of the content, used for deduplication and resumability
The fingerprint answers two questions before any upload begins:
- "Have I uploaded this file before?" — check if a record with this fingerprint exists for this user
- "Is there an in-progress upload I can resume?" — check if status is
uploading
Chunk-level fingerprinting: Each chunk also gets its own SHA-256 hash. This lets you identify exactly which chunks have already been uploaded during a resume, without re-uploading or re-checking all of them.
How S3 multipart upload works
S3 multipart upload has three API calls that tie everything together:
CreateMultipartUpload— your backend calls this before any chunk is uploaded. S3 returns anuploadId— a session identifier that ties all the parts together. Every presigned URL generated for a part must include thisuploadIdand the part's number.Upload parts — the client uploads each chunk directly to S3 using its presigned URL (which embeds
uploadId+partNumber). S3 stores each part separately and returns an ETag for each.CompleteMultipartUpload— once all parts are uploaded, your backend calls this with the list of part numbers and their ETags. S3 assembles all parts into a single object. Only after this call does the object exist as a complete unit in S3.
Full upload flow end to end
[Client]
1. Chunk file into 5–10 MB pieces
Compute SHA-256 fingerprint for the whole file
Compute SHA-256 fingerprint for each chunk
2. GET /files?fingerprint={hash}
→ Does a file with this fingerprint already exist for this user?
→ If status = "uploaded" → skip, file already exists, no upload needed
→ If status = "uploading" → fetch existing chunk statuses, resume from where it left off
→ If not found → proceed to initiate
3. POST /files/multipart-init
{ fingerprint, filename, totalChunks, chunkFingerprints[] }
→ Backend calls S3 CreateMultipartUpload → gets uploadId
→ Backend generates a presigned URL per chunk (each scoped to uploadId + partNumber)
→ Backend saves FileMetadata (status: "uploading") + FileChunks (all status: "pending")
← Returns { fileId, uploadId, presignedUrls[] }
4. For each chunk (in parallel):
PUT {presignedUrl} ← chunk bytes go directly to S3
← S3 returns ETag for that chunk
PATCH /files/{fileId}/chunks
{ chunkId, etag }
→ Backend calls S3 ListParts → verifies ETag actually exists in S3
→ Only if confirmed: update FileChunks status to "uploaded"
5. Once all chunks are "uploaded":
→ Backend calls S3 CompleteMultipartUpload with all partNumbers + ETags
→ S3 assembles all parts into one object
→ Backend updates FileMetadata status to "uploaded"
→ Processing pipeline triggered (virus scan, resize, etc.)
At this point:
- All chunks have been uploaded directly to S3
- Each ETag has been verified against S3's ListParts API
- The DB reflects confirmed chunk state
- Your backend has called CompleteMultipartUpload
- The object exists as a single unit in S3, ready for processing
Verifying chunk uploads with ETags
S3 and your backend are completely separate systems — when a client uploads a chunk to S3, your backend has no idea it happened. The client bridges the gap by sending a PATCH after each chunk upload.
PATCH (not POST or PUT) because you're partially updating an existing resource — just the status of one chunk, not creating or replacing the whole file.
Your DB tracks chunk state:
FileChunks table:
| fileId | chunkId | fingerprint | etag | status |
|--------|---------|-------------|---------------------|----------|
| f1 | chunk1 | 3d4e9a... | d41d8cd98f00b204... | uploaded |
| f1 | chunk2 | 7bc12f... | null | pending |
| f1 | chunk3 | 9fa33c... | null | pending |
The trust problem: Nothing stops a malicious user from sending a PATCH marking all chunks as uploaded without uploading anything. Your DB says complete; S3 has nothing. The file is broken and your system is in an inconsistent state.
The fix — ETag verification: ETags can't be faked — S3 generates them from the actual bytes received. Your backend verifies each ETag by calling S3's ListParts API, which returns all parts S3 actually holds for that uploadId. A fabricated ETag won't match — the fraud is caught before the DB is updated.
Why not use S3 event notifications? S3 only fires an ObjectCreated event when CompleteMultipartUpload is called — not for individual part uploads. You cannot use S3 events for per-chunk progress tracking. ListParts is the right API.
Trust but verify: The client drives progress updates for real-time UX. Your backend independently verifies against S3 before committing state to the DB.
Downloads don't need chunking
Once CompleteMultipartUpload is called, S3 assembles all parts into a single object. From that point on, downloads work like any normal file — the client gets one presigned URL (or CDN signed URL) and downloads the complete file. The original chunk boundaries are gone.
For very large files, HTTP natively supports Range requests: the client can request specific byte ranges in parallel or resume an interrupted download without starting over. The client doesn't need to know anything about the original chunk structure — S3 and HTTP handle it transparently.
At this point the object exists in S3 — but it hasn't been scanned, stripped of metadata, or validated. It is not yet safe to serve to users.
Processing Pipeline — Two-Bucket Strategy
Raw user uploads must never reach users directly. An uploaded object could contain malware, unstripped GPS metadata, illegal content, or simply be corrupt. The two-bucket pattern enforces a processing gate between upload and serving.
Architecture:
[Client]
↓ (presigned URL upload)
S3 Raw Bucket ← private, no public access, no CDN
↓ (s3:ObjectCreated event)
SQS / Lambda trigger
↓
Async Processing Workers:
├── Virus / malware scan ← reject and quarantine if infected
├── EXIF strip ← remove GPS coords, device ID, timestamps
├── Image validation + resize ← reject corrupt files, generate thumbnails
├── ML content moderation ← detect NSFW, violence, illegal content
└── Perceptual hash (dedup) ← detect re-uploaded duplicate content
↓ (only on pass)
S3 Processed Bucket ← CDN-accessible, public reads allowed
↓
CloudFront CDN → Users
At this point:
- The object has passed malware scanning, EXIF stripping, validation, and content moderation
- It has been copied to the processed bucket
- It is accessible via CloudFront CDN
- Users will only ever receive CDN URLs — never raw S3 URLs
Why each processing step matters:
- Virus scan — user uploads are untrusted. Serving an infected file to other users is a serious incident.
- EXIF strip — photos contain embedded metadata: GPS coordinates (exact home address), device model, serial number, timestamp. Strip before serving to protect user privacy.
- Validation + resize — reject zero-byte files, non-images claiming to be images, oversized files. Generate thumbnail variants (128px, 512px, 1080px) so clients don't download 12MP originals for a thumbnail slot.
- ML content moderation — detect NSFW or illegal content automatically. Flag for human review or auto-reject based on confidence threshold.
- Perceptual hashing — generate a hash of the image's visual content (not file bytes). Two identical photos re-encoded as different files have the same perceptual hash. Detect re-uploads, spam, and copyrighted content.
Users never get direct S3 URLs. They get CloudFront CDN URLs pointing to the processed bucket. Direct S3 URLs bypass CDN caching, expose your bucket structure, and can't be invalidated cleanly.
Trigger processing from S3, not from client: S3 event notifications (s3:ObjectCreated) trigger the processing pipeline directly. Don't depend on the client calling your backend to say "I finished uploading" — the client can crash, lose connectivity, or be malicious. If the S3 event triggers processing, you get guaranteed delivery regardless of client behaviour.
What-if: processing fails mid-pipeline?
If a worker crashes after malware scan but before EXIF strip, the file stays in the raw bucket — never promoted to the processed bucket, never visible to users. The SQS message re-queues and the worker retries from the start. Each processing step should be idempotent — re-running is safe. The file only moves to the processed bucket after all steps pass.
What-if: orphaned raw uploads?
A presigned URL expires after 15 minutes, but a file might be uploaded and then fail processing permanently (corrupt file that always fails validation). Apply an S3 lifecycle policy on the raw bucket: delete objects older than 24 hours that haven't been promoted. This keeps the raw bucket clean without manual intervention.
Interview answer: "Uploads go directly from the client to S3 via presigned URL — no bytes through our servers. On object creation, S3 fires an event to SQS, which triggers our async processing pipeline: malware scan, EXIF strip, resize, content moderation. Each step is idempotent — if a worker crashes it retries safely. Only on full pass does the object get copied to the processed bucket behind CloudFront. Users only ever receive CDN URLs, never raw S3 URLs. Orphaned objects in the raw bucket are cleaned up by a lifecycle policy after 24 hours."
Once a file passes all processing steps and lands in the processed bucket, it's ready to be served. The next question is how to control who can access it.
Download — Presigned URL Pattern
The upload pattern gets objects into S3. This pattern controls who can read them back out.
The problem: Your S3 bucket is private (no public access). You need to let an authorised user download their file — but you don't want to proxy the bytes through your backend, and you don't want to make the bucket public.
The solution: Generate a short-lived presigned GET URL server-side. The client uses it to fetch the object directly from S3. The URL expires after a set time — no valid URL, no access.
Flow:
1. Client requests a file
GET /files/{file_id}
Authorization: Bearer <access_token> ← your normal JWT auth
2. Backend checks permission
→ Does this user own this file? Can they access it?
→ If yes, generate a presigned S3 GET URL for the key
→ Short TTL: 5–15 minutes
→ Return { download_url } to client — do NOT return the raw S3 key
3. Client fetches the object directly from S3
GET {download_url} ← file bytes come from S3, not your servers
→ URL includes signature — S3 validates it, no auth header needed
→ After TTL expires, URL returns 403 — even if someone shares the link
Why not just make the bucket public?
- Anyone with any S3 URL can access any object — no per-user access control
- No way to revoke access to a specific file or user
- Exposes your entire bucket structure
TTL tradeoffs:
| TTL | Use case |
|---|---|
| 1–5 min | Sensitive documents, medical records, payment receipts |
| 15 min | Standard user files, photos |
| 1–24 hours | Downloads the user explicitly triggered (export, backup) |
| 7 days | Shared links the user sends to others |
Never return the S3 key to the client. Return only the presigned URL. If the client knows the key pattern (users/123/profile.jpg) they can guess other users' keys.
Presigned URLs vs CloudFront signed URLs:
| S3 Presigned URL | CloudFront Signed URL | |
|---|---|---|
| Served from | S3 directly | CDN edge node (fast, cached) |
| Use when | One-off downloads, low traffic | High-traffic files, media, anything needing CDN speed |
| Revocation | URL expires naturally | Can invalidate at CDN layer |
| Cost | S3 GET cost per request | CloudFront pricing, but fewer origin hits |
For a profile photo viewed thousands of times a day — use CloudFront signed URLs so the object is cached at the edge. For a tax document downloaded once a year — a plain S3 presigned URL is fine.
Generating a presigned URL requires knowing whether the requesting user actually has permission. That's handled by the access control layer.
File Access Control — Data Modelling
The download pattern checks "does this user have access?" before issuing a presigned URL. This section covers how you store and query those permissions efficiently.
Naive approach — sharelist embedded in file metadata:
Store a list of permitted user IDs directly on the file record:
Files table:
| fileId | ownerId | s3Key | sharedWith |
|--------|---------|-------------|---------------------|
| f1 | user1 | docs/f1.pdf | [user2, user3, ...] |
Works for small scale. Breaks quickly:
- To find all files shared with user2, you must scan every file row and inspect the list — no efficient query
- The list grows unbounded as more users are added
- Updating sharing permissions means rewriting the entire array
Normalised approach — SharedFiles mapping table:
Extract sharing into its own table. Each row is one (user, file) permission pair.
SharedFiles table:
| userId (Partition Key) | fileId (Sort Key) |
|------------------------|-------------------|
| user1 | fileId1 |
| user1 | fileId2 |
| user2 | fileId3 |
userId and fileId together form a composite primary key — the same user can have many files, the same file can be shared with many users. The file metadata no longer needs a sharelist.
Queries this enables:
| Query | How |
|---|---|
| All files shared with user2 | Query SharedFiles WHERE userId = user2 |
| All users who can access fileId1 | Query SharedFiles WHERE fileId = fileId1 |
| Does user2 have access to fileId1? | Point lookup: userId=user2, fileId=fileId1 |
| Revoke user2's access to fileId1 | Delete row: userId=user2, fileId=fileId1 |
All are efficient indexed lookups — no table scans, no array manipulation.
SQL vs DynamoDB:
| SQL | DynamoDB | |
|---|---|---|
| Primary key | Composite primary key on (userId, fileId) | userId as partition key, fileId as sort key |
| Query by user | SELECT * FROM SharedFiles WHERE userId = ? |
Query on partition key — O(1) |
| Query by file | SELECT * FROM SharedFiles WHERE fileId = ? (needs index) |
Requires a Global Secondary Index (GSI) on fileId |
In DynamoDB, querying by fileId (all users who can access a file) requires a GSI — partition key fileId, sort key userId. Add it if that query is needed (e.g. showing a file's collaborator list).
How it connects to the download flow:
GET /files/{fileId}
Authorization: Bearer <access_token> ← identifies userId
Backend:
1. Extract userId from JWT
2. Point lookup: SharedFiles WHERE userId=X AND fileId=Y
(or check ownerId on Files table — owner always has access)
3. If row exists → generate presigned S3 GET URL → return to client
4. If no row → 403 Forbidden
The permission check is a single indexed lookup before the presigned URL is issued. S3 itself stays fully private — it never makes access decisions, it just honours the signed URL your backend produced.
Owner access: The file owner doesn't need a row in SharedFiles. Check ownerId on the Files table separately, or insert an owner row at upload time — whichever keeps the query path simpler.
Interview answer: "Access control is a normalised SharedFiles table — one row per (userId, fileId) pair. Before issuing any presigned URL, we do a point lookup on that table. If no row exists, we return 403 — S3 never sees the request. The file owner is checked separately on the Files table. In DynamoDB, we use userId as the partition key and fileId as the sort key, with a GSI on fileId for listing a file's collaborators."
Access control determines who can request a URL. Security layers determine how the file itself is protected in storage and transit.
File Security
Four layers that work together to keep files secure.
Layer 1 — Encryption in transit (HTTPS)
All communication between client and server uses HTTPS (TLS). Data is encrypted in the network pipe — an attacker intercepting traffic sees only ciphertext. This applies to every request: login, presigned URL generation, and the direct S3 upload/download.
This is a baseline requirement — not a design decision. Every modern file storage system uses it.
Layer 2 — Encryption at rest (S3 server-side encryption)
Objects stored in S3 are encrypted at rest using AES-256. When an object is uploaded, S3 generates a unique encryption key for it, encrypts the object with that key, and stores the key separately. If an attacker gains physical access to the storage hardware, they get encrypted bytes with no key — the file is unreadable.
S3 offers three options:
| Option | Who manages the key | Use when |
|---|---|---|
| SSE-S3 | AWS manages everything | Default — no operational overhead |
| SSE-KMS | AWS KMS holds the key, you control access policy | Audit trail needed, fine-grained key access control |
| SSE-C | You provide the key on every request | Regulatory requirement to hold your own keys |
For most systems SSE-S3 is sufficient. SSE-KMS adds a CloudTrail audit log of every decryption — useful for compliance. SSE-C means you manage key rotation and storage yourself.
Layer 3 — Access control (who can request a URL)
The SharedFiles table (covered above) is your ACL. Before generating any presigned URL, check that the requesting user has a row in SharedFiles for that file (or is the owner). If not — 403, no URL issued.
This is the gate. Layers 1 and 2 protect the bytes in storage and transit. This layer controls who can even get a URL to start.
Layer 4 — Signed URLs: TTL, bearer token risk, and higher security options
A signed URL is a bearer token — anyone who holds a valid, unexpired URL can download the file. No auth header required. S3 doesn't know or care who is making the request; it only checks that the URL signature is valid and the TTL hasn't expired.
This means: if an authorised user intentionally or accidentally shares a presigned URL (posts it publicly, sends it to the wrong person), that URL works for anyone until it expires. A short TTL (5–15 min) limits the exposure window but doesn't fully prevent sharing.
How CloudFront signed URL validation works:
- Your backend generates a URL signed with your private key — incorporating the URL path, expiration timestamp, and optionally extra restrictions
- The signed URL is returned to the authorised user
- When the user (or anyone with the URL) hits CloudFront, CloudFront verifies the signature using your registered public key
- CloudFront checks: valid signature? not expired? restrictions match? If yes → serve content. If no → 403
The private key never leaves your backend. CloudFront only holds the public key. This means only your backend can generate valid signed URLs — S3 or CloudFront cannot be fooled with a forged URL.
Hardening options for higher-security scenarios:
| Option | How it works | Use when |
|---|---|---|
| Short TTL | URL expires in 5–15 min | Default — limits accidental sharing window |
| IP binding | Signed URL includes the requester's IP — only valid from that IP | Corporate internal tools where users have fixed IPs |
| Auth cookie alongside URL | Require a valid session cookie in addition to the signed URL | Streaming media — prevents URL sharing across users |
| One-time URLs | Backend marks URL as used after first access — subsequent requests rejected | Highly sensitive documents, payment receipts |
IP binding and auth cookies are CloudFront features. One-time URLs require your own tracking layer (Redis key marked used on first hit).
Interview answer: "Security is four layers. HTTPS encrypts everything in transit — baseline, non-negotiable. S3 server-side encryption (SSE-S3 by default, SSE-KMS for audit trails) protects objects at rest. The SharedFiles table is our ACL — no presigned URL is issued without a permission check. Signed URLs are bearer tokens, so we keep TTLs short (5–15 min). For higher-security use cases we use CloudFront signed URLs with IP binding or require an auth cookie alongside the URL."
With uploads, processing, delivery, and security covered, the final challenge is keeping files consistent across multiple devices.
File Sync Across Devices
How files stay consistent across a user's devices — the core problem behind Dropbox, Google Drive, and iCloud. The remote server is the source of truth. Every device syncs to and from it.
There are two directions to handle:
Device A ──→ Remote Server ──→ Device B
(local change uploaded) (change pushed/pulled down)
Local → Remote (upload on change)
A client-side sync agent runs in the background on the device. It monitors the local sync folder for changes using OS-specific file system events:
- Windows: FileSystemWatcher
- macOS: FSEvents
- Linux: inotify
When a change is detected the agent:
- Queues the modified file locally (so changes survive a crash or network drop)
- Uploads the file to the server via the presigned URL upload pattern
- Sends updated metadata to the backend (filename, size,
updatedAt, checksum)
Conflict resolution — last write wins: If two devices edit the same file and both upload, the server keeps whichever write arrived last (based on updatedAt timestamp). The earlier write is overwritten. Simple and predictable. More sophisticated systems (Google Docs) use operational transforms to merge concurrent edits — that is out of scope here.
The tradeoff is silent data loss — if two users edit the same file concurrently, the earlier write is permanently overwritten with no warning and no recovery path. This is acceptable for many use cases but should be an explicit product decision, not an accidental default.
Chunking for large files: Don't upload the whole file on every change. Split the file into chunks, compute a fingerprint per chunk, and only upload chunks that changed. This is delta sync — a 1GB file with a small edit uploads one or two chunks, not 1GB. See Content-Defined Chunking below for why fixed-size chunks don't work well for this.
Remote → Local (receiving changes)
Each client needs to know when another device (or another user) changes a file so it can pull the update down. Two approaches:
Option 1 — Polling:
Client periodically calls GET /files/changes?since={lastSyncTimestamp}. Server queries the DB for any files this user has access to where updatedAt > lastSyncTimestamp and returns the diff.
- Simple to implement
- Guaranteed to catch every change (stateless, timestamp-based)
- Slow to detect changes (latency = polling interval)
- Wastes bandwidth when nothing has changed
Option 2 — WebSocket or SSE: Server maintains a persistent connection per device/session and pushes a notification the moment a file changes. Client receives the event and pulls the updated file immediately.
- Near real-time updates
- No wasted requests when nothing changes
- More complex — connections drop, messages can be missed, server must track open connections
Hybrid approach (best of both):
One WebSocket (or SSE) connection per device session for real-time push. Plus periodic polling as a safety net to catch anything missed during a dropped connection.
[Device]
│
├── WebSocket connection to server (always open)
│ Server pushes: { fileId, updatedAt, changeType }
│ Device pulls the changed file immediately on receipt
│
└── Polling every few minutes as fallback
GET /files/changes?since={lastSyncTimestamp}
Catches any events missed during WebSocket downtime
The polling interval is infrequent (every 2–5 minutes) — it's a safety net, not the primary channel. The WebSocket handles the real-time feel; polling guarantees eventual consistency.
updatedAt as the sync cursor: Every file record has an updatedAt timestamp. The client stores the timestamp of its last successful sync. On each poll it sends that timestamp and gets back only the files that changed since then — not the full file list.
Full sync flow — end to end:
[Device A edits file.txt]
│
├── FSEvents / FileSystemWatcher detects change
├── Sync agent queues file locally
├── POST /uploads → gets presigned S3 URL
├── Uploads file directly to S3
├── PATCH /files/{fileId} → updates metadata (updatedAt, checksum, s3Key)
│
▼
[Server stores new metadata in DB]
│
├── Looks up all devices/sessions for this user (and shared users)
├── Pushes change notification over WebSocket: { fileId, updatedAt }
│
▼
[Device B receives WebSocket event]
│
├── Sees fileId it has locally
├── Requests presigned S3 GET URL from backend
├── Downloads updated file directly from S3
└── Replaces local copy, updates lastSyncTimestamp
Interview answer: "The sync agent monitors the local folder using OS file system events (FSEvents on macOS, FileSystemWatcher on Windows). On change, it queues the file locally and uploads directly to S3 via presigned URL — never through our backend. Each device holds one WebSocket connection; the server pushes change notifications in real time. Clients also poll every few minutes as a reliability fallback to catch anything missed during a dropped connection. Conflict resolution is last-write-wins on updatedAt — simple and predictable, but a deliberate product tradeoff since concurrent edits silently overwrite earlier writes. For large files we use CDC chunking and only upload changed chunks."
Upload, Download, and Sync Optimizations
How to make each of the three operations as fast as possible, beyond the basics.
| Operation | Already covered | Additional optimizations |
|---|---|---|
| Download | CDN caches file closer to user | — |
| Upload | Chunking + parallel parts maximise bandwidth | Compression, CDC |
| Sync | Only upload changed chunks (delta sync) | CDC makes delta sync actually work |
Content-Defined Chunking (CDC) — making delta sync effective
Fixed-size chunking (e.g. every 5MB) has a fatal flaw for delta sync: if you insert a single byte near the beginning of a file, every chunk boundary after that point shifts. All subsequent chunks produce different fingerprints — even though their content barely changed. Delta sync detects them all as "changed" and re-uploads them. For a 1GB file with a one-byte edit, you could end up re-uploading nearly the whole file.
Content-Defined Chunking (CDC) fixes this by determining chunk boundaries based on the file's content rather than byte position. A rolling hash (Rabin fingerprinting) scans the file byte by byte and declares a boundary whenever the hash matches a pattern. Because boundaries are anchored to content, a small edit only affects the chunks immediately surrounding the change — the vast majority of chunks remain identical and are skipped.
Fixed-size chunking — insert 1 byte near start:
Before: [chunk1][chunk2][chunk3][chunk4][chunk5]
After: [chunk1'][chunk2'][chunk3'][chunk4'][chunk5'] ← all shifted, all "changed"
CDC — insert 1 byte near start:
Before: [chunk1][chunk2][chunk3][chunk4][chunk5]
After: [chunk1'][chunk2][chunk3][chunk4][chunk5] ← only chunk1 affected
This is how Dropbox achieves efficient delta sync in practice. Without CDC, chunking is still useful for parallel uploads and resumability — but it doesn't give you true delta sync.
Client-side compression — fewer bytes, faster transfers
Since uploads go directly from the client to S3 (no bytes through your backend), compression must happen entirely on the client. The client compresses before uploading; S3 stores the compressed bytes as-is. On download, the client decompresses after retrieving. The backend stays completely out of the data path.
When compression is worth it:
Not all files compress well. Compression only helps if the transfer time saved outweighs the CPU time spent compressing and decompressing.
| File type | Compression ratio | Worth it? |
|---|---|---|
| Text, code, CSVs, logs | High — 5GB → ~1GB | Yes, usually |
| Office documents (docx, xlsx) | Moderate | Depends on size |
| Already-compressed media (JPEG, MP4, MP3) | Near zero — 1–3% | No |
| ZIP, RAR, already-compressed archives | Near zero | No |
A reasonable default: compress if the file is text-based (MIME type text/*, application/json, application/xml, application/csv) and larger than 1MB. Skip compression for JPEG, PNG, MP4, MP3, ZIP, and other already-compressed formats. On mobile, also skip compression if the measured upload bandwidth exceeds 10 Mbps — the CPU cost outweighs the transfer saving.
Always compress before encrypting — encryption introduces artificial randomness into the data, making it nearly incompressible. Compressing after encrypting achieves almost nothing. The correct order is always: compress → encrypt → upload.
Compression algorithms:
| Algorithm | Compression ratio | Speed | Best for |
|---|---|---|---|
| Gzip | Good | Moderate | Universal support everywhere |
| Brotli | Better than Gzip for text | Moderate | Web, modern browsers (Chrome, Firefox, Edge, Safari) |
| Zstandard (zstd) | Comparable to Gzip, tunable | Very fast | Client-side compression — fast enough to not block UX |
For a Dropbox-style system where compression runs on the client device, Zstandard is a strong default — it compresses and decompresses significantly faster than Gzip at comparable ratios, and its speed/ratio tradeoff is tunable via compression level.
Video Streaming Pipeline
How to store, process, and deliver video at scale — the core building blocks behind YouTube, Netflix, and TikTok.
The Core Problem
Video has three properties that make it fundamentally different from other content:
- Size — a 10-minute 1080p video is 1–2 GB raw. You cannot serve the original file to users.
- Heterogeneous clients — a 4K TV and a 3G phone need different quality levels of the same video.
- Seek — users skip to arbitrary timestamps. You cannot stream one giant file; you need seekable chunks.
The answer to all three: transcode into multiple resolutions, split into small segments, serve from CDN.
Upload and Transcoding Pipeline
The full flow from a creator uploading a video to a viewer playing it.
[Creator Device]
│ direct upload via presigned URL (no bytes through backend)
▼
[S3 Raw Bucket] ← private, original file, no CDN
│ S3 event notification on object creation
▼
[SQS / Kafka] ← job queue, one message per upload
│ fan-out: one job per output resolution
▼
[Transcoding Workers] ← stateless fleet, auto-scales (EC2 Spot or AWS Elemental)
│ FFmpeg converts raw → multiple resolutions + formats
│ Splits each output into 6-second segments
│ Writes HLS manifest (.m3u8) per resolution
▼
[S3 Processed Bucket] ← CDN-accessible
│
▼
[CloudFront CDN] ← viewers never touch S3 directly
Why fan-out per resolution: Transcoding is CPU-bound and slow (a 1-hour video at 4K can take 30–60 minutes). Parallelizing across workers — one per resolution — cuts wall-clock time proportionally. All workers read from the same raw file in S3.
Output resolutions produced:
| Resolution | Bitrate | Target |
|---|---|---|
| 360p | ~0.5 Mbps | 3G mobile |
| 480p | ~1 Mbps | 4G mobile |
| 720p | ~2.5 Mbps | Laptop / WiFi |
| 1080p | ~5 Mbps | Desktop / Smart TV |
| 1440p / 4K | ~15 Mbps | Premium / large screen |
Interview note on storage: You store one raw file and N processed versions. At YouTube scale (500 hours uploaded per minute), this is significant — tiered storage (S3 Standard for hot, S3 Glacier for old videos) is necessary. Long-tail content (old videos rarely watched) can be archived and transcoded on-demand.
Adaptive Bitrate Streaming (ABR)
The mechanism that lets the player switch quality mid-stream without buffering or user action.
The two dominant protocols:
| HLS (HTTP Live Streaming) | DASH (Dynamic Adaptive Streaming over HTTP) | |
|---|---|---|
| Origin | Apple (2009) | MPEG standard (2012) |
| Supported by | iOS required, universal support | Chrome, Android, Smart TVs, all non-Apple |
| Manifest format | .m3u8 (M3U playlist) |
.mpd (XML) |
| Segment format | .ts (MPEG-TS) or fMP4 |
fMP4 |
| Segment length | 6–10 seconds typical | 2–10 seconds |
| Use when | Must support iOS natively | Non-Apple / cross-platform |
In practice: Most platforms produce both HLS and DASH from the same segments. The player picks the appropriate manifest.
How ABR works — the manifest structure:
master.m3u8 (master manifest — lists all quality levels)
│
├── 360p.m3u8 (playlist for 360p — lists all segments)
│ ├── seg_000.ts
│ ├── seg_001.ts
│ └── ...
│
├── 720p.m3u8
│ ├── seg_000.ts ← same timestamps, different quality
│ └── ...
│
└── 1080p.m3u8
└── ...
Playback sequence:
- Player fetches master manifest — gets list of all quality levels
- Player picks starting quality (based on current bandwidth estimate)
- Player fetches that quality's segment playlist, downloads segment 0, starts playing
- After each segment, player re-estimates bandwidth: if download was fast → step up quality; if slow → step down
- Quality switches happen at segment boundaries — seamless, no seek interruption
Why segments matter for seek: User jumps to 4:30 → player calculates which segment contains that timestamp → fetches only that segment from CDN. No need to download the whole video or re-establish a stream.
CDN Strategy for Video
Video is the highest-egress workload a CDN will serve. How you cache matters.
Pull CDN (default for most platforms):
- CDN edge node caches segments on first request
- Cold start: first viewer in a region causes a cache miss back to S3 (1 round trip penalty)
- Works well for long-tail content — no point pre-loading videos nobody will watch
Push CDN (for viral/live content):
- Proactively push segments to all edge nodes before viewers arrive
- Used when you know content will spike immediately (live events, major releases)
- Netflix does this for popular titles — pre-positions content at edge nodes in each region the night before a big release
Segment caching vs full-file:
- Always cache at the segment level, never the full video file
- Segments are small (1–5 MB), fit cleanly in edge cache
- Seek-driven requests are cache-efficient — only the requested segment is fetched
Popular vs long-tail — the 90/10 rule:
- Top ~10% of videos drive ~90% of traffic
- Keep popular content at edge nodes with high TTL (1–7 days)
- Long-tail content: pull CDN with aggressive cache eviction, or serve from a closer regional origin
Reducing CDN costs:
- CloudFront signed URLs for premium/paid content — prevents sharing the URL
- Origin Shield: single regional cache tier in front of S3, absorbs cache misses from all edge nodes, reduces S3 GET costs and request volume
Metadata and State
The video file pipeline handles bytes. A separate metadata service handles everything else.
[Metadata DB — PostgreSQL]
- video_id, title, description, creator_id
- status: UPLOADING → PROCESSING → READY → FAILED
- duration, resolution list, thumbnail URL
- view count, like count (denormalized counters)
[Search Index — Elasticsearch]
- title, description, tags, transcript (if captions generated)
- Synced from metadata DB via CDC
[Recommendation / Feed Service]
- Reads from separate engagement DB (views, watch time, likes)
- Not part of the video pipeline
Status transitions: The transcoding worker updates status in the metadata DB when it finishes each stage. Frontend polls or receives a webhook when status hits READY. The CDN URL is only surfaced to viewers once status = READY — never expose a partial or raw URL.
What-If Scenarios for Video
Transcoding worker crashes mid-job:
At-least-once delivery from SQS — job is re-queued after visibility timeout expires. Worker must be idempotent (re-processing same video produces same segments). Use a job-level status in DB: PENDING → IN_PROGRESS → DONE. If a worker picks up an IN_PROGRESS job older than 2× expected duration, treat it as crashed and restart.
Upload is corrupt or invalid: First stage of pipeline is a validation worker: check container format, codec, minimum duration. On failure, update status to FAILED, emit event, notify creator. Do not proceed to transcoding.
CDN cache stale after re-transcode: If a video is re-transcoded (quality fix, policy violation edit), old segments may be cached at edge nodes. Trigger a CDN invalidation on the affected path prefix. For HLS, the manifest file is the key — once the manifest is invalidated and re-fetched, players will pull new segments.
Storage costs blow up: Apply tiered storage: move videos with 0 views in 90 days to S3 Glacier. On-demand transcode from raw if a Glaciered video suddenly gets requested (rare, acceptable latency for cold content). Delete raw files after transcoding is confirmed — you can always re-transcode from the raw if needed, but raw files are 10–50× larger than processed output.
Interview one-liner: "Creator uploads directly to S3 via presigned URL. S3 triggers a transcoding job queue. Workers transcode in parallel to multiple resolutions, split into 6-second HLS segments, write to processed S3 behind CloudFront. The player fetches a manifest listing all quality levels and switches resolution per segment based on bandwidth — that's adaptive bitrate. Viewers never touch origin storage."
Financial & Payments Patterns
Domain-specific patterns for guaranteeing payment integrity, consistent state machines, and reliable event delivery to merchants.
PaymentIntent State Machine
A PaymentIntent is a high-level payment request that tracks the full lifecycle of a charge. It owns the state machine and enforces idempotency for retries. Transactions are the individual money movements it generates — one PaymentIntent can have many Transactions (retries, partial payments, refunds).
State machine:
created → authorized → captured
↘ cancelled
↘ failed
captured → refunded (partial or full)
Why separate PaymentIntent from Transaction:
- PaymentIntent = merchant-facing, stable reference for the business operation
- Transaction = polymorphic money-movement record (charge, capture, refund, reversal, transfer)
- Separation simplifies reasoning about state, audits, and retries without diving into ledger details
Idempotency enforcement: PaymentIntent ID is the idempotency key for the charge. Retrying a payment attempt against the same PaymentIntent returns the existing result — no double charge.
Two-Phase Event Model
Production pattern (used by Stripe) for guaranteeing consistency between transaction state and downstream event consumers.
NFR triggers: Durability, consistency, financial integrity.
The two events per transaction:
Transaction Created Event — emitted when the transaction service begins processing, before the DB write completes. If emission fails → transaction enters locked/failed state, can be retried safely.
Transaction Completed Event — emitted after the DB write successfully completes. If emission fails → transaction enters locked state, further updates blocked until completion event is successfully emitted.
Why two phases enable smart retries:
- On retry, compare the "created" event data with actual DB state
- If DB write already occurred → only re-emit the completion event (no full transaction retry needed)
- Prevents double charges when the network drops between the DB write and the response
Production reality: Most missing "completed" events are due to external payment network timeouts before the DB write — not emission failures after successful writes. The two-phase model handles both cases cleanly.
Webhook Delivery
Server-to-server HTTP POST notifications sent to a merchant's configured endpoint when payment state changes. Different from WebSockets/SSE — webhooks are server-to-server, not server-to-browser.
NFR triggers: Availability (reliable delivery), fault tolerance (retry on failure).
How it works:
- CDC streams DB changes to Kafka
- Webhook service consumes events, checks if merchant has subscribed to that event type
- Signs payload with shared secret (HMAC) so merchant can verify authenticity
- POSTs to merchant's configured HTTPS endpoint
- Records delivery attempt with status
Retry strategy — exponential backoff:
Attempt 1: immediate
Attempt 2: 5 seconds
Attempt 3: 25 seconds
Attempt 4: 125 seconds
...up to ~1 hour max
Merchant side: Must return 2xx to acknowledge receipt. Any non-2xx → webhook service treats as failure and retries. Merchant must handle idempotency (same event may be delivered more than once on retry).
Payload signing: Webhook payload signed with HMAC using shared secret. Merchant verifies signature before processing — prevents spoofed webhook events.
Production considerations: Idempotency keys on delivery, webhook logs for debugging, merchant dashboard to replay failed webhooks, payload versioning, adaptive rate limiting to avoid overwhelming slow merchant endpoints.
Reconciliation
Periodically comparing your internal transaction records against the payment processor's records to detect discrepancies.
NFR triggers: Durability, consistency (payments, banking).
Why it's needed: Networks fail, bugs happen, systems crash mid-transaction. Your DB might say a payment succeeded while the processor says it failed, or vice versa.
How it works:
- At end of day (or hourly), fetch all transactions from payment processor API.
- Compare against your internal records row by row.
- Flag mismatches for investigation or auto-correction.
- Retry failed-but-charged transactions. Refund charged-but-not-delivered.
Real example: Stripe, PayPal, and every bank runs nightly reconciliation jobs. Airline booking systems reconcile seat inventory against payment records after every batch.
Saga Pattern (Distributed Transactions)
Manages a multi-step transaction across services by breaking it into a sequence of local transactions, each with a compensating rollback action if a later step fails.
NFR triggers: Consistency across microservices, no distributed locks, durability in distributed systems.
Why not two-phase commit (2PC): 2PC locks resources across all participants until all agree — too slow and brittle at scale. One slow service blocks the entire transaction. Saga is async and each step is independent.
The Two Styles: Choreography vs Orchestration
These are the two ways to implement Saga. The choice determines whether you write the coordination logic yourself or hand it to a framework.
Choreography — services react to events, no central coordinator:
Each service listens to Kafka, does its job, publishes the next event.
No one is "in charge" — the workflow emerges from the event chain.
Order Service → publishes OrderCreated
Inventory Service → listens, reserves stock → publishes StockReserved
Payment Service → listens, charges card → publishes PaymentCharged
Shipping Svc → listens, creates shipment → publishes ShipmentCreated
If Payment fails:
Payment Service → publishes PaymentFailed
Inventory Svc → listens, releases stock → publishes StockReleased
Order Service → listens, cancels order → publishes OrderCancelled
You write this yourself — each service just has Kafka consumers and producers. No framework needed.
Orchestration — a central coordinator tells each service what to do:
Saga Orchestrator (your coordinator service or Temporal workflow):
Step 1: call Inventory Service → reserve stock
Step 2: call Payment Service → charge card
Step 3: call Shipping Service → create shipment
If Step 2 fails:
compensate Step 1: call Inventory Service → release stock
mark order as failed
The orchestrator owns the workflow state. Services are dumb — they just do what they're told and report back.
Do You Write the Code Yourself?
Choreography — yes, you write it yourself. Each service has a Kafka consumer for incoming events and a producer for outgoing events. The "saga" is just the sum of all these listeners.
// Inventory Service — choreography style
@KafkaListener(topics = "order-created")
public void onOrderCreated(OrderCreatedEvent event) {
boolean reserved = inventory.reserve(event.getSkuId(), event.getQty());
if (reserved) {
publisher.publish("stock-reserved",
new StockReservedEvent(event.getOrderId(), event.getSkuId()));
} else {
publisher.publish("stock-reservation-failed",
new StockFailedEvent(event.getOrderId(), "OUT_OF_STOCK"));
}
}
@KafkaListener(topics = "payment-failed")
public void onPaymentFailed(PaymentFailedEvent event) {
// compensating transaction — undo the reservation
inventory.release(event.getOrderId());
publisher.publish("stock-released",
new StockReleasedEvent(event.getOrderId()));
}
No framework. Just Kafka listeners and a clear rule: every service that does work must also listen for failure events and undo its work.
Orchestration — you either write a coordinator or use a framework.
Hand-rolled orchestrator:
// Your own coordinator service — tracks state in a DB table
public void runOrderSaga(Order order) {
SagaState state = sagaRepo.create(order.getId(), STARTED);
try {
inventoryClient.reserve(order);
state.update(INVENTORY_RESERVED);
paymentClient.charge(order);
state.update(PAYMENT_CHARGED);
shippingClient.createShipment(order);
state.update(COMPLETED);
} catch (PaymentException e) {
inventoryClient.release(order); // compensate
state.update(FAILED);
} catch (ShippingException e) {
paymentClient.refund(order); // compensate
inventoryClient.release(order); // compensate
state.update(FAILED);
}
}
This works but has a fatal flaw: if the coordinator crashes between steps, you have no idea where it left off. You need to rebuild state from the DB and figure out what to retry. This is exactly the problem Temporal solves — it persists every step automatically.
Compensating Transactions — What They Actually Are
A compensating transaction is not a rollback. It's a new forward action that undoes the business effect of a previous step.
Step done: Compensation:
───────────────────────────────────────────────────────
Reserve inventory Release inventory
Charge payment Issue refund
Create shipment Cancel shipment + notify carrier
Send confirmation email Send cancellation email
Notify driver (Uber) Un-notify driver, reassign ride
Key constraint: Compensations must be idempotent. The coordinator may call them more than once (crash + retry). release(orderId) called twice must be safe — second call is a no-op if stock is already released.
Failure Modes — What Can Go Wrong
1. Compensating transaction itself fails:
Payment charged → Shipping fails → try to refund → refund call times out
Fix: retry with backoff. Refunds are idempotent (refund twice = still one refund).
If retries exhaust, write to a dead-letter topic for manual review.
Never silently swallow a failed compensation — money is involved.
2. Service crashes mid-saga (choreography):
Inventory reserved → service crashes → PaymentCharged event never heard
Fix: outbox pattern on every step. Event is in the outbox before the crash,
worker publishes it on restart. Saga continues where it left off.
3. Out-of-order events:
StockReleased arrives before StockReserved (network reordering)
Fix: idempotency + state machine. Each service tracks its current state per orderId.
If StockReleased arrives and state is not RESERVED, ignore it.
4. Partial visibility to users:
Saga is mid-flight — inventory reserved but payment not yet charged.
User queries order status — what do you show?
Fix: show intermediate states explicitly. "Order processing — payment pending."
Never show CONFIRMED until the saga fully completes.
Choreography vs Orchestration — When to Use Each
| Choreography | Orchestration | |
|---|---|---|
| Who coordinates | Nobody — services react to events | Central coordinator or framework |
| Code ownership | Each team owns their service's listeners | One team owns the orchestrator |
| Visibility | Hard — workflow is implicit in event chains | Easy — workflow is explicit in one place |
| Coupling | Services coupled to event schemas | Services coupled to orchestrator API |
| Failure tracing | Hard — must correlate events across services | Easy — orchestrator has full state |
| Best for | Simple 2–3 step workflows, small teams | Complex workflows, many services, long-running |
Rule of thumb: Start with choreography. Switch to orchestration (or Temporal) when you find yourself building a "where is this saga at?" dashboard.
Real Systems — Choreography vs Orchestration
Choreography fits when the flow is a linear chain and each service only needs to know what happened, not what to do next.
| System | Flow | Why choreography fits |
|---|---|---|
| E-commerce order pipeline (simple) | Order → Inventory → Payment → Shipping → Email | Linear chain, no branching, each service reacts to the previous event |
| LinkedIn / Twitter feed fanout | User posts → Fanout → Notification → Analytics | Each service listens independently, no one needs to know the others exist |
| IoT sensor pipeline | Sensor event → Validate → Aggregate → Store → Alert | High throughput, simple chain, no compensation needed |
| Audit / event logging | Action → audit log, analytics, search index each react | Classic pub/sub — multiple independent consumers, no coordination |
| Simple notification system | User action → Email service, Push service, SMS service each listen | No coordination needed between notification channels |
Orchestration fits when the flow has branching, timeouts, human delays, or you need exact visibility into where the workflow is.
| System | Flow | Why orchestration fits |
|---|---|---|
| Uber ride matching | Match driver → 10s timeout → try next → try next → give up | Timeout + fallback branching — Temporal was built for exactly this |
| Stripe payment processing | Charge → fraud check → settle → webhook | Exact state required at all times; dispute resolution has human delays (bank response) |
| Amazon order fulfillment | Reserve → pick → pack → ship → notify | Out-of-stock branches, address invalid branches, carrier failure — too complex for choreography |
| Airbnb booking | Reserve dates → charge → notify host → wait for host approval → confirm guest | Human delay up to 24h — choreography cannot model waiting for a human |
| DoorDash / food delivery | Place order → restaurant accepts (timeout) → assign driver (timeout) → pickup → delivery | Multiple human-in-the-loop timeouts at every step |
| Bank loan approval | Credit check → risk scoring → manual underwriter review → approval → disbursement | Days-long workflow, human steps, full audit trail required |
| Travel booking (flight + hotel + car) | All three must succeed or all roll back | Orchestrator coordinates all-or-nothing across three independent services |
| Healthcare claim processing | Submit → eligibility → prior auth → adjudication → payment | Complex branching at every step, weeks-long timeline |
The signal in an interview:
| Interviewer says... | Reach for... |
|---|---|
| "notify multiple services when X happens" | Choreography |
| "user has N seconds to do Y, then try Z, then give up" | Orchestration + Temporal |
| "all steps must succeed or all roll back" | Orchestration |
| "we need to know exactly what state this order is in" | Orchestration |
Frameworks Available
You do NOT have to hand-roll Saga in production. Mature frameworks exist.
Temporal (orchestration — most powerful):
// Workflow definition — Temporal handles crash recovery, retries, state
@WorkflowInterface
public interface OrderSaga {
@WorkflowMethod
void processOrder(Order order);
}
public class OrderSagaImpl implements OrderSaga {
private final InventoryActivities inventory = Workflow.newActivityStub(...);
private final PaymentActivities payment = Workflow.newActivityStub(...);
private final ShippingActivities shipping = Workflow.newActivityStub(...);
@Override
public void processOrder(Order order) {
try {
inventory.reserve(order);
payment.charge(order);
shipping.createShipment(order);
} catch (ActivityFailure e) {
inventory.release(order); // compensate
payment.refund(order); // compensate
throw e;
}
}
}
// Temporal persists every step. Crash between reserve and charge → resumes at charge.
// No lost saga state. No hand-rolled state table.
Used by: Uber, Netflix, Stripe, Airbnb.
AWS Step Functions (orchestration — managed, AWS-native):
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:ReserveInventory",
"Next": "ChargePayment",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "CompensateInventory" }]
},
"ChargePayment": { ... },
"CompensateInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:ReleaseInventory",
"Next": "FailOrder"
}
}
}
No code for the coordinator — it's JSON/YAML. AWS manages state persistence. Used by: Amazon order fulfillment, DoorDash.
Axon Framework (choreography + orchestration — Java-native):
- Built for Java/Spring. Handles both Saga styles via annotations.
@SagaEventHandlerfor choreography,@CommandHandlerfor orchestration.- Built-in saga state persistence to any DB.
- Used in enterprise Java shops (banking, insurance).
@Saga
public class OrderSaga {
@SagaEventHandler(associationProperty = "orderId")
public void on(OrderCreatedEvent event) {
commandGateway.send(new ReserveInventoryCommand(event.getOrderId()));
}
@SagaEventHandler(associationProperty = "orderId")
public void on(InventoryReservationFailedEvent event) {
commandGateway.send(new CancelOrderCommand(event.getOrderId()));
SagaLifecycle.end();
}
}
Eventuate Tram (choreography — lightweight):
- Open source, Spring-native.
- Handles Saga state machine, compensating transactions, and the outbox pattern automatically.
- Lower learning curve than Temporal for simple flows.
- Used in teams that want Saga without the Temporal operational overhead.
Conductor (orchestration — Netflix OSS):
- Netflix's open-source workflow engine.
- Define workflows as JSON, workers are microservices that poll for tasks.
- Good fit if you're already Netflix-stack (Java, Spring).
Framework decision:
Simple 2–3 step saga, small team → hand-roll choreography with Kafka
Complex workflow, crash recovery → Temporal
AWS-native, Lambda-based → Step Functions
Java/Spring enterprise shop → Axon Framework
Want outbox handled automatically → Eventuate Tram
Real examples:
- Uber — built Cadence (Temporal's predecessor) specifically because hand-rolled saga coordination broke under ride-matching complexity.
- Amazon — Step Functions orchestrate order fulfillment across Warehouse, Payment, Shipping, and Notification services.
- Stripe — Temporal for payment retry and dispute workflows where human delays (bank responses, manual review) make duration unpredictable.
Outbox Pattern
Guarantees that a database write and a Kafka publish always stay in sync — even if the service crashes between them.
NFR triggers: Consistency between DB state and event stream, at-least-once delivery, audit trail durability.
The problem — two systems, no shared transaction:
Without outbox:
Step 1: Write payment to ledger DB ← succeeds
Step 2: Publish to Kafka ← service crashes here
Payment recorded. Order Service never got the event. Order stays in limbo — user paid but order not confirmed.
Swap the order?
Step 1: Publish to Kafka ← succeeds
Step 2: Write payment to ledger DB ← service crashes here
Opposite problem — Order Service confirmed the order, no payment record exists. Worse: confirmed order, no audit trail.
Root cause: you cannot make a DB write and a Kafka publish atomic. They are two completely different systems.
The fix — convert the Kafka publish into a DB write:
Single DB transaction:
1. Write payment to ledger ← your real data
2. Write event to outbox table ← a "to-do" note for the worker
Both in the same transaction → both succeed or both fail → always consistent.
Then the worker handles Kafka separately:
Worker (runs every 5s):
Read unpublished outbox rows → publish to Kafka → mark published
If the worker crashes mid-publish, the outbox row stays at published=false.
Next run retries. This is "at-least-once" — you may publish twice, but you
will always publish. Consumers need to handle duplicates (idempotency key).
The full guarantee:
DB write succeeded → Kafka will eventually get the message (even after retries)
DB write failed → Kafka will never get the message (outbox row never created)
Payment processing sequence — correct order:
1. Receive payment request
2. Check idempotency key ← already processed? return cached result
3. Call payment gateway ← Visa / Mastercard / PayPal
4. Gateway returns APPROVED
5. Write to ledger DB ← record the outcome
6. Write to outbox table ← same transaction as step 5
7. Commit
8. Return success to caller
The ledger write happens AFTER the gateway responds — not before.
The gateway is the source of truth for whether money actually moved.
What if the service crashes between gateway response and ledger write (steps 4–5)?
The idempotency key rescues you. Next retry:
User retries with same idempotency key
→ Idempotency check: key not found (crashed before writing)
→ Call gateway again with same reference ID
→ Gateway returns "already processed, here's the result" ← gateways support this
→ Write to ledger + outbox, return success
All three gateway outcomes and why each matters:
APPROVED → write AUTHORIZED to ledger + outbox atomically → cache idempotency key → return success
DECLINED → write DECLINED to ledger (audit trail) → cache key (retry returns same result,
no gateway hit) → return declined
TIMEOUT → write PENDING to ledger → return error "do not retry with new key"
Background reconciliation job queries gateway by reference ID to resolve later.
Never tell a user to retry on timeout with a new key — they may be charged twice.
Implementation — polling variant (no CDC required):
// Outbox table schema:
// id, topic, partition_key, payload, published, created_at
// Write side — same transaction as your domain write:
@Transactional
public void processPayment(PaymentRequest req) {
ledger.save(new LedgerEntry(req.getPaymentId(), AUTHORIZED, req.getAmount()));
outbox.save(new OutboxEvent("payment-events", req.getPaymentId(),
toJson(new PaymentAuthorizedEvent(req.getPaymentId()))));
}
// Worker — polls every 5 seconds:
@Scheduled(fixedDelay = 5000)
public void publishOutbox() {
List<OutboxEvent> pending = outbox.findByPublishedFalseOrderByCreatedAt();
for (OutboxEvent event : pending) {
try {
producer.send(new ProducerRecord<>(event.getTopic(),
event.getPartitionKey(), event.getPayload())).get();
event.setPublished(true);
outbox.save(event);
} catch (Exception e) {
log.error("Outbox publish failed, will retry: {}", event.getId(), e);
}
}
}
Implementation — CDC variant (Debezium, no polling worker):
DB write → Debezium reads binlog → publishes outbox row changes to Kafka
Pros: no polling worker to operate, lower latency (event-driven not scheduled)
Cons: Debezium is another piece of infrastructure to run and monitor
Use polling when: you want simplicity, Debezium isn't already in your stack
Use CDC when: you need sub-second latency, Debezium is already running
Why consumers must be idempotent:
The worker publishes → Kafka acks → worker crashes before marking published=true. Next run publishes again. The consumer receives the same event twice. Without idempotency, the order gets confirmed twice or the inventory decremented twice. With an idempotency key on the consumer side, the second delivery is a no-op.
Outbox vs direct Kafka publish:
| Direct Publish | Outbox Pattern | |
|---|---|---|
| Atomicity | None — DB and Kafka are independent | Guaranteed — outbox write is part of DB transaction |
| Crash safety | Crash between DB write and publish = lost event | Crash after DB commit = event retried by worker |
| Delivery | At-most-once (lost if crash) | At-least-once (duplicates possible, handled by idempotency) |
| Complexity | Simple | Outbox table + worker (or Debezium) |
| When to use | Acceptable to lose events (metrics, logs) | Event loss is unacceptable (payments, orders, inventory) |
Real examples:
- Stripe — outbox pattern for every payment event; worker retries ensure webhook delivery and reconciliation services always receive events.
- Walmart — POS sale writes to inventory DB + outbox in one transaction; CDC variant with Debezium publishes inventory changes to Elasticsearch and cache invalidation topics.
Durable Execution — Temporal and AWS Step Functions
Manages long-running, multi-step workflows that must survive service crashes, human delays, and arbitrary timeouts — resuming exactly where they left off.
How it differs from Saga:
- Saga — handles rollback when a step fails. State lives in your DB + event bus. You write the compensation logic.
- Durable Execution — handles everything: timeouts, retries, state persistence, resume after crash. The framework owns the workflow state. You write business logic; the framework handles fault tolerance.
They're not mutually exclusive — a Temporal workflow can implement a Saga internally. But if you're reaching for Temporal, you likely don't need to hand-roll Saga logic.
The signal — human-in-the-loop processes:
Any time a human can introduce an unpredictable delay, you have a durable execution problem:
- Driver has 10 seconds to accept a ride request
- User has 30 minutes to confirm an email
- Manager has 48 hours to approve an expense
- Payment processor takes up to 5 minutes to respond
A standard request/response or even a Kafka-based flow can't cleanly handle "wait up to 10 seconds, then try the next driver, then try the next, then give up" without significant custom state management. Durable execution handles this natively.
How Temporal works:
Temporal persists every step of a workflow to its own database. If the worker running your workflow crashes mid-execution, another worker picks it up and replays from the last committed step — no state lost.
Temporal Workflow: MatchRide(rideRequest)
1. candidates = findNearbyDrivers(rideRequest.location) // Activity
2. for each driver in candidates:
result = sendRequestWithTimeout(driver, timeout=10s) // Activity + built-in timeout
if result == ACCEPTED:
return assignRide(rideRequest, driver) // Activity
// timeout or DECLINED → loop continues automatically
3. if no driver found: cancelRide(rideRequest) // Activity
If the worker crashes between steps 2 and 3, Temporal replays from step 2 — no dropped ride request, no double-assignment. The 10-second timeout is declared in code, not implemented with Redis TTL + a background job.
Temporal vs AWS Step Functions:
| Temporal | AWS Step Functions | |
|---|---|---|
| Hosting | Self-hosted or Temporal Cloud (managed) | Fully managed (AWS) |
| Workflow definition | Code (Go, Java, Python, TypeScript) | JSON/YAML state machine |
| Flexibility | High — full programming language | Limited — state machine model |
| Latency | Low (~ms per activity) | Higher (~100ms+ per state transition) |
| Pricing | Open source free / Temporal Cloud per action | AWS pay-per-state-transition |
| Best for | Complex business logic, human delays, long-running | Simple AWS-native orchestration, Lambda chains |
When Temporal is the right answer:
- Multi-step workflows with human delays (ride acceptance, checkout abandonment, approval flows)
- Workflows that run for minutes, hours, or days
- Complex branching and retry logic you'd otherwise build with cron jobs + state tables
- Mission-critical flows where dropped requests directly cost revenue (ride-sharing, payments, order fulfillment)
When it's overkill:
- Simple timeout → retry loop with no branching — Redis TTL + a background job is sufficient and far simpler
- Short-lived operations (< 1 second) — just use async queues
- Teams unfamiliar with the framework — Temporal has a learning curve; don't introduce it for trivial orchestration
The nuance for interviews: Redis TTL handles the lock expiry problem (driver didn't respond, lock auto-releases). But who retries with the next driver? If it's just "try the next driver in a list," a simple background job works. If the workflow has branching, multiple fallback levels, SLA tracking, and audit requirements — that's when you reach for Temporal. Name Redis first as the simpler solution, then offer Temporal as the escalation path if the interviewer pushes on fault tolerance.
Real examples:
- Uber — built Cadence (open-sourced 2019), which Temporal forked and became the leading implementation. Used for ride matching, driver onboarding, surge pricing workflows.
- Netflix — uses Temporal for content encoding workflows (multi-step, hours-long)
- Stripe — uses workflow orchestration for payment retry and dispute resolution flows
- AWS — Step Functions orchestrate Lambda-based order fulfillment, image processing pipelines
Two-Phase Commit (2PC)
Coordinates a transaction across multiple independent databases or services so they all commit or all abort — atomically, in one shot.
NFR triggers: Strong consistency across multiple databases, true atomicity across services.
The Two Phases
Phase 1 — Prepare (voting):
Coordinator → "Can you commit transaction T?" → Participant A
Coordinator → "Can you commit transaction T?" → Participant B
Coordinator → "Can you commit transaction T?" → Participant C
Each participant:
- Writes the transaction to its WAL (durable on disk)
- Acquires all necessary locks
- Replies YES or NO
At this point resources are LOCKED. No other transaction can touch them.
Phase 2 — Commit or Abort:
All said YES → Coordinator sends COMMIT to all
Any said NO → Coordinator sends ABORT to all
Each participant executes and releases locks.
Coordinator logs the final decision.
Why It's Slow
Every message is a network round-trip. With 3 participants:
Phase 1: Coordinator → A, B, C (3 messages out)
A, B, C → Coordinator (3 votes back)
Phase 2: Coordinator → A, B, C (3 messages out)
A, B, C → Coordinator (3 acknowledgements back)
= 4 network round-trips minimum
= 10–100ms added latency per transaction
= locks held for that entire duration
While locks are held, every other transaction that touches the same rows is blocked. Under high load this causes a queue of waiting transactions — throughput collapses.
The Coordinator Crash Problem — Why 2PC Is "Blocking"
This is the fatal flaw:
Phase 1 completes — all participants voted YES and locked their resources.
Coordinator crashes before sending Phase 2.
Participants are now stuck:
- They voted YES so they cannot unilaterally abort
- They have not received COMMIT so they cannot commit
- Their locks are held — no other transaction can proceed
They wait. Until the coordinator recovers.
This is why 2PC is called a blocking protocol — it cannot make progress without the coordinator. A crashed or slow coordinator stalls the entire distributed transaction, and every row it touched is locked for the duration.
Where 2PC Is Actually Used
Despite the problems, it has a place:
| Use case | Why 2PC is acceptable |
|---|---|
PostgreSQL cross-DB transactions (PREPARE TRANSACTION) |
Same datacenter, low latency, coordinator crash is rare, strong consistency required |
| Banking — money moves between two DB instances | Atomicity is non-negotiable, transaction volume is manageable, 50ms overhead is acceptable |
| XA transactions (Java EE / JTA) | Legacy enterprise apps coordinating DB + message broker atomically in one transaction |
| Single datacenter, 2–3 participants, low volume | Network is fast and reliable, coordinator failure is rare, throughput is not a concern |
When NOT to Use 2PC
- Microservices across the internet — network is unreliable, latency spikes are common, coordinator crash is a real risk
- High-throughput systems — locks block concurrent transactions, throughput collapses under load
- Services you don't own — can't implement the 2PC protocol on a third-party API (Stripe, Twilio, etc.)
- Anything that must survive coordinator failure gracefully — use Saga instead
2PC vs Saga
| 2PC | Saga | |
|---|---|---|
| Atomicity | True atomic — all or nothing in one shot | Eventual — compensate after the fact |
| Locks | Holds locks across all participants for the full duration | No cross-service locks |
| Coordinator crash | Blocking — participants stuck waiting | No blocking — each step is independent |
| Intermediate state | Never visible — all or nothing | Visible mid-saga (partial completion) |
| Throughput | Low — locks + round trips under load | High — async, no blocking |
| Network requirement | Reliable, low-latency (same datacenter) | Tolerates unreliable network |
| Best for | Same-datacenter, 2–3 DBs, low volume, must be truly atomic | Microservices, high scale, eventual consistency acceptable |
Real examples:
- PostgreSQL —
PREPARE TRANSACTION/COMMIT PREPAREDimplements 2PC natively for cross-database operations. - MySQL — XA transactions for coordinating with a message broker (e.g. commit DB write and message enqueue atomically).
- Banking — intra-bank transfers between two internal DB instances where atomicity is legally required and volume is controlled.
- Avoid at scale — Uber, Amazon, Netflix explicitly avoid 2PC across services. They use Saga (with Temporal or Axon) because 2PC's blocking behavior under failure is incompatible with their availability requirements.
Limits Quick Reference
Key numbers to memorize — throughput, capacity, and scaling thresholds for every major component in this doc.
| Component | Key Numbers |
|---|---|
| Redis Cache | 100GB RAM/instance, 100K–1M ops/sec, 512MB max value size |
| Redis Sorted Sets | 4.3B members max, ~50–100 bytes/member overhead, O(log n) all ops |
| Redis Pub/Sub | Millions of msgs/sec, no persistence, no history for late subscribers |
| Redis Geo | Precision: 4 chars=20km, 6 chars=610m, 8 chars=19m. 4.3B member limit. |
| Kafka | Millions msgs/sec/broker, 1MB default msg size, ~4K partitions/broker, 5–30ms latency |
| WebSockets | 10K–100K connections/server, ~2–8KB overhead per connection |
| Read Replicas | 5–10 practical limit, < 100ms replication lag same AZ, seconds cross-region |
| DB Sharding | Consider sharding at ~5K–10K writes/sec or ~1–2TB data |
| Bloom Filter | 1B items at 1% FPR ≈ 1.2GB RAM. No false negatives. No deletes. |
| Consistent Hashing | 100–200 virtual nodes/physical node. Adding a node remaps only 1/n keys. |
| Geohashing | 4 chars=20km, 5 chars=2.4km, 6 chars=610m, 7 chars=76m, 8 chars=19m |
Quick Reference — NFR → Building Block
Given a problem or NFR, this maps you directly to the component that solves it.
| NFR / Problem | Reach For |
|---|---|
| Read-heavy, slow DB | Redis Cache (cache-aside), Read Replicas, CDN |
| Write-heavy, DB bottleneck | Kafka, DB Sharding, WAL-based async replication |
| Real-time push to clients | WebSockets (bidirectional), SSE (server→client), Redis Pub/Sub (fan-out) |
| Latency < 100ms | Redis Cache, CDN, Redis Geo (for location), precomputed results |
| Proximity / location search | Redis Geo, Geohashing, Quadtree |
| Fan-out (1 event → many consumers) | Kafka (consumer groups), Redis Pub/Sub |
| Prevent double payment / double action | Idempotency Keys |
| Multi-step transaction, microservices | Saga Pattern |
| Multi-step transaction, single team, strong consistency | 2PC |
| Payment discrepancy detection | Reconciliation |
| Slow/failing downstream service | Circuit Breaker |
| API abuse, DDoS protection | Rate Limiter (Token Bucket at API Gateway) |
| Scale cache/DB nodes without resharding everything | Consistent Hashing |
| Avoid DB lookup for non-existent keys | Bloom Filter |
| Metrics, sensor data, time-based queries | Time-Series DB |
| Crash recovery, replication | Write-Ahead Log (WAL) |
| Files, images, video at scale | Object Storage (S3) |
| User file upload without proxying through backend | Presigned S3 URL (client uploads direct) |
| Ensure uploaded files are safe before serving | Two-bucket strategy + async processing pipeline |
| Prove API request is authentic, prevent replay | HMAC request signing + Nonce |
| Store encryption private keys securely (PCI-DSS) | HSM (AWS CloudHSM / AWS KMS) |
| Audit trail without changing app code | CDC (Debezium → Kafka) |
| Guarantee each payment event processed exactly once | Kafka exactly-once semantics |
| Push payment status updates to merchant servers | Webhook delivery with exponential backoff |
| Track payment lifecycle, enforce idempotency | PaymentIntent state machine |
| Consistency between transaction state and consumers | Two-phase event model |
| Leader election, distributed config | etcd (modern), Consul, ZooKeeper (legacy) |
| Cluster membership, failure detection | Gossip Protocol |
| How etcd / CockroachDB reach consensus | Raft |
| Prevent two workers processing same job | Distributed Lock (Redis SETNX or etcd) |
| Services finding each other dynamically | Service Discovery (Consul, Kubernetes Service) |