Fundamentals

Core infrastructure terms used across all system design docs. Read this first — every other doc assumes you know these.

Infrastructure Geography

Region — a geographic area with its own cluster of datacenters. Examples: us-east-1 (N. Virginia), eu-west-1 (Ireland). Data stays in the region unless you explicitly replicate it out. Choose region based on where your users are and data residency requirements (GDPR).

Availability Zone (AZ) — a physically separate datacenter within a region. Same city roughly, but different buildings with independent power, cooling, and networking. AZs within a region are connected by low-latency private fiber (< 2ms).

Multi-AZ — components spread across 2+ AZs within the same region. Protects against a single datacenter going down. Standard for anything above 99% availability.

Multi-region — system runs in 2+ geographic regions. Protects against a full regional outage and serves users from the nearest region. Much more complex and expensive — only justified at 99.99%+ or global user bases.

	Multi-AZ	Multi-region
Protects against	Single datacenter failure	Full regional outage
Latency between nodes	< 2ms	60–150ms
Complexity	Low	High
When to use	99.9%+ SLA	99.99%+ or global users

Primary / Replica

Primary — the single node that accepts all writes. Also called leader or master.

Replica — a copy of the primary that serves reads. Cannot accept writes directly. There can be many replicas — each is a full copy of the data.

          writes
Client ──────────► Primary ──── replicates ───► Replica 1
          reads                               └─► Replica 2
Client ──────────► Replica 1 or 2

Failover — when the primary goes down, one replica is promoted to become the new primary. Takes ~30–60s with managed services (AWS RDS Multi-AZ, GCP Cloud SQL HA). Handled automatically by the DB proxy or cluster endpoint.

Replication: Sync vs Async

Synchronous replication — primary waits for at least one replica to confirm the write before acknowledging success to the client. No data loss on primary failure (RPO = 0), but every write pays extra latency.

Client → Primary → write to disk → wait for replica ACK → reply to client
                                   └──► Replica (must confirm before client gets OK)

Extra latency: +5–20ms per write
Use when: payments, bank transactions — data loss is unacceptable

Asynchronous replication — primary acknowledges the write immediately, then replicates in the background. Faster writes, but a crash before replication completes loses that data (RPO = seconds).

Client → Primary → write to disk → reply to client immediately
                 └──► Replica (happens in background, no waiting)

Extra latency: none
Use when: social feeds, user profiles — tiny data loss is acceptable

Replication lag — the delay between a write landing on the primary and becoming visible on replicas. Typically < 100ms same AZ, seconds cross-region (async). A user who just wrote something might read stale data if their read goes to a lagging replica. Fix: route a user's own reads back to the primary for a short window after their writes (sticky routing).

Sharding

Split data across multiple nodes so no single node holds everything. Each node holds a shard — a subset of the total data. Scales both storage and write throughput linearly with the number of shards.

Shard key — the field used to decide which shard a record belongs to. Example: shard by user_id. All data for a given user lives on one shard.

user_id 1–1M   → Shard 1  (its own DB node)
user_id 1M–2M  → Shard 2
user_id 2M–3M  → Shard 3

Cross-shard queries — if you need data from multiple shards (e.g. "find all orders above $100 across all users"), you must query every shard and merge results. Design your shard key around your most common query pattern.

When to shard: At 10K–50K QPS or when data volume outgrows what fits on one host. Before that, read replicas and caching are simpler and sufficient.

	Replicas	Sharding
Scales	Reads	Reads + Writes + Storage
Data per node	Full copy	Subset
When to reach for it	Read-heavy, < 10K QPS	Write-heavy or data too large for one node

Stateless vs Stateful

Stateful — the server remembers something about the client between requests (session data, in-progress work). If that server dies, the state is lost. Client must reconnect to the same server.

Stateless — the server remembers nothing. Every request carries all the information needed to handle it (JWT token, request body). Any server can handle any request. If a server dies, the load balancer routes to another — client notices nothing.

Stateful:  Client must always hit Server A (holds their session)
Stateless: Client hits Server A, B, or C — all identical, all can handle it

Why stateless matters for scale: Stateless servers can be added or removed freely behind a load balancer. Auto-scaling works because any new instance is identical to any existing one. Push session state to Redis, not in-memory.

Hot Standby vs Warm Standby vs Cold Standby

Three levels of failover readiness — trading cost for recovery speed.

	Hot Standby	Warm Standby	Cold Standby
State	Running and in sync, serving read traffic	Running but not serving traffic	Stopped or not yet provisioned
Failover time	Seconds (just redirect traffic)	1–5 minutes (promote and configure)	10–60 minutes (spin up, restore backup)
Cost	Highest (full duplicate running)	Medium (instance running, idle)	Lowest (pay only on failure)
Use when	99.99%+ SLA, zero tolerance for downtime	99.9% SLA, minutes of downtime OK	Dev/test, or batch systems where hours of downtime is fine

Real examples:

AWS RDS Multi-AZ → hot standby (replica always in sync, failover in ~60s)
A spare EC2 instance kept running but not in rotation → warm standby
Restoring from an S3 snapshot to a new instance → cold standby

Hashing

A hash function maps any input to a fixed-size output. Same input always produces the same output. Used to decide where data lives in distributed systems.

Modulo hashing — the simplest approach:

node = hash(key) % number_of_nodes

Works fine until you add or remove a node — then number_of_nodes changes and almost every key maps to a different node. For a cache, this means a cache miss storm. For a DB, it means a massive resharding operation.

Consistent Hashing

Solves the modulo hashing problem — when nodes are added or removed, only a small fraction of keys need to move. With plain hash(key) % n, adding one node to a 10-node cluster remaps roughly 90% of keys; consistent hashing remaps only ~1/n.

How it works: Both nodes and keys are mapped onto a circular ring (0 to 2³²) using the same hash function. The ring wraps around — 2³² and 0 are the same point. A key is assigned to the first node found clockwise from its position.

Basic ring — key goes to the first node clockwise:

  0 ──── [Node A] ──── [Node B] ──── [Node C] ──── 2³² (wraps to 0)

  Key K: hash(K) falls between B and C  →  assigned to Node C
  Key J: hash(J) falls between C and 0  →  assigned to Node A (wrap)

When a node is added, it takes over keys only from the arc immediately before it — roughly 1/n of total keys, drawn from one neighbour. When a node is removed, its keys pass to the next clockwise node.

Virtual nodes: Each physical node gets 100–200 positions on the ring by hashing name variants (DB1-vn1, DB1-vn2, …). These positions interleave naturally with those of other nodes.

With virtual nodes (3 nodes × 2 positions each, interleaved):

  0 ── [A-1] ── [C-2] ── [B-1] ── [A-2] ── [C-1] ── [B-2] ── 2³²

  Physical owners:  A        C        B        A        C        B

  Remove Node B:  B-1 keys → C-1  |  B-2 keys → A-2
                  → load spreads across C and A, not doubled on one node

  Add capacity:   a node with 2× hardware gets 2× virtual positions

This matters in two ways: when a node is removed its load redistributes across multiple neighbours rather than doubling one of them; when a node is added it absorbs a small slice from multiple existing nodes rather than only relieving one.

Virtual nodes vs hotspots — two distinct problems. Virtual nodes prevent structural imbalance (uneven key distribution across nodes). They don't prevent workload imbalance: if a single key receives disproportionate traffic, all that load still lands on one node regardless. That's a hotspot — see Hot Keys below.

Data movement in practice. Consistent hashing tells you where data should live; it doesn't move terabytes automatically. Most production systems use replication alongside the ring so that failures require no data movement at all — a replica is simply promoted. Data actually moves only during planned topology changes like adding capacity, and even then consistent hashing ensures only a bounded fraction of keys need to be re-replicated.

Replication. Systems like Cassandra extend the ring naturally: a key is replicated to the next N nodes clockwise, not just one. Same ring, tunable fault tolerance via the replication factor.

Where it's used:

Distributed cache (Memcached) — adding a cache node only invalidates ~1/n of cached keys instead of everything
DB sharding (Cassandra, DynamoDB) — partition data with minimal resharding on topology changes
Load balancing — route a client to the same backend consistently (useful for stateful sessions)

Redis Cluster is different. Redis Cluster does not use consistent hashing. It uses fixed hash slots: 16,384 slots assigned via CRC16(key) mod 16384, with slot ranges mapped to nodes. Simpler to reason about, but requires more coordination when rebalancing. Worth knowing the distinction for interviews.

The term is used loosely. Some systems that claim to use consistent hashing actually use variations — hash-based partitioning with fixed slot ranges, for instance. The core principle that matters is minimising data movement during rebalancing.

Interview guidance. For most designs involving DynamoDB, Cassandra, or Redis, you only need to note that they handle distribution internally. Go deep on consistent hashing only when the question is explicitly about building a distributed database, cache, or message broker from scratch — then be ready to cover all of the above.

Hot Keys

A single key that receives a disproportionately large share of traffic, making the node holding it a bottleneck regardless of how many other nodes exist. A celebrity user, a viral post, a popular product — consistent hashing distributes keys evenly on average, but one extremely popular key still lands on one node.

Fixes by layer:

Layer	Fix	How
Cache	Key fanout	Store identical copies under multiple keys (`key:1`, `key:2`, … `key:N`). Readers pick one randomly. Spreads load across N nodes.
Cache	Local in-process cache	Cache the hot value in each app server's memory for 1–5s. Eliminates Redis entirely for that key.
DB shard	Shard key redesign	Add a suffix to spread writes across shards. Random suffix (`user_42_7`) requires scatter-gather reads across all N shards. Deterministic suffix (`user_id % N`) lets you recompute the target shard at read time — single lookup, no fan-out. Best when the bottleneck is writes; for read hotspots, cache-layer fixes are almost always the better tool.
Kafka	Partition key redesign	Use a more granular partition key (e.g. `user_id` instead of `region`) to spread messages.

CAP Theorem

In a distributed system, during a network partition (nodes can't communicate), you must choose between:

Consistency (C) — every read sees the most recent write. If you can't guarantee this, refuse the request.
Availability (A) — every request gets a response, even if the data might be stale.

Partition tolerance (P) is not a choice — network partitions happen in any distributed system. So the real choice is always C vs A when a partition occurs.

Choice	Behaviour during partition	Examples
CP (Consistent + Partition tolerant)	Refuse requests rather than serve stale data	PostgreSQL, Zookeeper, HBase
AP (Available + Partition tolerant)	Serve possibly stale data rather than refuse	Cassandra, DynamoDB, CouchDB

In practice: Most systems aren't purely CP or AP — they let you tune consistency per operation. Cassandra's QUORUM read is CP-like; ONE read is AP-like.

Interview signal:

"Users can't see stale data under any circumstance" → CP
"Always respond, even if slightly stale" → AP
Payment systems → CP. Social feeds → AP.

Thundering Herd

When a large number of clients simultaneously try to access the same resource, overwhelming it.

Where it appears:

1. Cache expiry — a popular cache entry expires. Every concurrent request sees a miss and races to rebuild from the DB. At 100K req/sec, that's 100K identical DB queries in the same instant.

2. Client reconnects — a WebSocket or SSE server crashes. All connected clients detect the drop and reconnect simultaneously, slamming the auth service and remaining servers.

3. Retry storms — a downstream service goes slow. All upstream clients hit their timeout at the same time and retry simultaneously, creating a second wave of traffic exactly when the downstream is most stressed.

Fixes:

Problem	Fix	How
Cache expiry	TTL jitter	`TTL = base + random(0, 30s)` — stagger expiry so keys don't all expire together
Cache expiry	Probabilistic early refresh	Start refreshing before expiry with increasing probability as TTL runs down
Cache expiry	Request coalescing	Only one request rebuilds the cache; others wait for that result
Client reconnects	Reconnect jitter	`delay = base * 2^attempt + random(0, 1s)` — spread reconnects over a window
Retry storms	Exponential backoff + jitter	Double the wait between retries, add randomness so clients don't retry in sync

Backpressure

When a downstream component is overwhelmed, it signals upstream components to slow down rather than accepting more work than it can handle — preventing the overload from cascading deeper into the system.

How it's implemented:

Mechanism	How	Used in
Bounded queue	Queue has a max size. When full, producer blocks or gets an error.	Java `BlockingQueue`, Kafka producer `max.block.ms`
TCP flow control	Receiver advertises a window size. Sender cannot exceed it.	Built into TCP — automatic
Rate limiting	API Gateway rejects requests above a threshold with 429.	Nginx, API Gateway
Load shedding	Service deliberately drops low-priority requests under load rather than queuing them.	Real-time systems where stale responses are worse than no response

Interview signal: Whenever you add a queue, ask "what happens when the consumer falls behind?" If the queue is unbounded, you have a latent crash waiting to happen. Backpressure is the answer.

Fan-out

One write or event triggers delivery to many downstream consumers. The ratio of one-to-many is the core challenge — a single action can explode into millions of operations.

Two models:

Push fan-out (eager) — when a write happens, immediately compute and deliver to all consumers. High write cost, very fast reads.

User posts tweet → fan out to all 10M followers' feeds immediately
Read feed → instant (already precomputed)
Write cost: 10M operations per tweet

Pull fan-out (lazy) — store the write once. Each consumer fetches and merges on read. Low write cost, slower reads.

User posts tweet → store once in author's timeline
Read feed → merge timelines of all followed users at read time
Read cost: O(accounts followed) per feed load

Hybrid (Twitter's actual approach): Push fan-out for normal users (fast reads). Pull fan-out for celebrity accounts (too many followers to push to). Read path merges both.

Where fan-out appears:

Social feeds (posts → follower feeds)
Notifications (one event → many device pushes)
Kafka topics (one message → multiple consumer groups each getting a full copy)
Redis Pub/Sub (one publish → all channel subscribers)

Head-of-Line Blocking

When a slow or stuck item at the front of a queue or connection blocks every item behind it, even if those items could be processed immediately.

HTTP/1.1 example:

Request 1 (slow, 2s)  ──► Server ──► Response 1 (after 2s)
Request 2 (fast, 1ms) ──► BLOCKED until Request 1 completes
Request 3 (fast, 1ms) ──► BLOCKED until Request 2 completes

HTTP/2 fix — multiplexing: Multiple requests share one connection as independent streams. A slow stream doesn't block other streams.

Stream 1 (slow) ──┐
Stream 2 (fast) ──┼──► shared connection ──► all processed independently
Stream 3 (fast) ──┘

Where HOL blocking appears:

Context	How it manifests	Fix
HTTP/1.1	Slow request blocks the connection	HTTP/2 multiplexing
DB connection pool	One long query holds a connection; others queue	Query timeouts, separate read/write pools
Kafka single partition	One slow consumer blocks all messages behind it	More partitions, consumer timeout config
Single-threaded queue	One expensive job blocks all others	Separate queues by job type/priority

Interview signal: If you have a queue mixing fast and slow jobs (e.g. image resize + video transcode), a backed-up transcode job will delay all image resizes behind it. Separate queues by job type to eliminate HOL blocking between workload classes.