Fundamentals
Core infrastructure terms used across all system design docs. Read this first — every other doc assumes you know these.
Infrastructure Geography
Region — a geographic area with its own cluster of datacenters. Examples: us-east-1 (N. Virginia), eu-west-1 (Ireland), asia-pacific-1 (Tokyo). Data stays in the region unless you explicitly replicate it out. Choose region based on where your users are and data residency requirements (GDPR).
Availability Zone (AZ) — a physically separate datacenter within a region. Same region = same city roughly, but different buildings with independent power, cooling, and networking. AZs within a region are connected by low-latency private fiber (< 2ms). AWS us-east-1 has 6 AZs (us-east-1a through us-east-1f). GCP calls them "zones" (us-east1-a, us-east1-b, etc.).
Why AZs exist: A single datacenter can go down — power outage, cooling failure, flooding. If all your components live in one AZ, that's a single point of failure. Spreading across AZs protects against datacenter-level failures.
Multi-AZ — your components are spread across 2+ AZs within the same region. Protects against a single datacenter going down. Standard practice for anything above 99% availability.
Region: us-east-1
├── AZ: us-east-1a → Primary DB, App servers
└── AZ: us-east-1b → Standby DB replica, App servers
Multi-region — your system runs in 2+ separate geographic regions. Protects against a full regional outage and serves users from the nearest region. Much more complex and expensive than multi-AZ — only justified at 99.99%+ or global user bases.
us-east-1 (primary) ←──replication──► eu-west-1 (standby or active)
Multi-AZ vs Multi-region:
| Multi-AZ | Multi-region | |
|---|---|---|
| Protects against | Single datacenter failure | Full regional outage |
| Latency between nodes | < 2ms | 60–150ms |
| Complexity | Low | High |
| When to use | 99.9%+ SLA | 99.99%+ or global users |
Primary / Replica
Primary — the single node that accepts all writes. Also called leader or master. Every write goes here first.
Replica — a copy of the primary that serves reads. Also called standby, follower, or slave. Cannot accept writes directly. There can be many replicas — each is a full copy of the data.
writes
Client ──────────► Primary ──── replicates ───► Replica 1
reads └─► Replica 2
Client ──────────► Replica 1 or 2
Why replicas exist:
- Scale reads — spread read traffic across multiple nodes instead of hammering the primary
- High availability — if the primary fails, promote a replica to become the new primary (failover)
- Redundancy — data exists on multiple machines, no single point of data loss
Failover — when the primary goes down, one replica is promoted to become the new primary. Takes ~30–60s with managed services (AWS RDS Multi-AZ, GCP Cloud SQL HA). The app reconnects to the new primary — handled automatically by the DB proxy or cluster endpoint.
Replication: Sync vs Async
How data gets from the primary to replicas — and what it costs.
Synchronous replication — primary waits for at least one replica to confirm the write before acknowledging success to the client. No data loss on primary failure (RPO = 0), but every write pays extra latency.
Client → Primary → write to disk → wait for replica ACK → reply to client
└──► Replica (must confirm before client gets OK)
- Extra latency: +5–20ms per write (waiting for replica in another AZ)
- Use when: payments, bank transactions — data loss is unacceptable
Asynchronous replication — primary acknowledges the write immediately, then replicates in the background. Faster writes, but a crash before replication completes loses that data (RPO = seconds).
Client → Primary → write to disk → reply to client immediately
└──► Replica (happens in background, no waiting)
- Extra latency: none — write acks immediately
- Use when: social feeds, user profiles — tiny data loss is acceptable
Replication lag — the delay between a write landing on the primary and becoming visible on replicas. Typically < 100ms same AZ, seconds cross-region (async). This means a user who just wrote something might read stale data if their read goes to a lagging replica. Fix: route a user's own reads back to the primary for a short window after their writes (sticky routing).
Sharding
Split data across multiple nodes so no single node holds everything. Each node holds a shard — a subset of the total data.
Why shard: A single DB node has limits — storage, write throughput, connection count. When you outgrow one node, sharding is the answer. It scales both storage and write throughput linearly with the number of shards.
Shard key — the field used to decide which shard a record belongs to. Example: shard by user_id. All data for a given user lives on one shard.
user_id 1–1M → Shard 1 (its own DB node)
user_id 1M–2M → Shard 2
user_id 2M–3M → Shard 3
The hard part — cross-shard queries: If you need data from multiple shards (e.g. "find all orders above $100 across all users"), you must query every shard and merge results. Avoid this by designing your shard key around your most common query pattern.
When to shard: At 10K–50K QPS or when data volume outgrows what fits on one host. Before that, read replicas and caching are simpler and sufficient.
| Replicas | Sharding | |
|---|---|---|
| Scales | Reads | Reads + Writes + Storage |
| Data per node | Full copy | Subset |
| When to reach for it | Read-heavy, < 10K QPS | Write-heavy or data too large for one node |
Stateless vs Stateful
Stateful — the server remembers something about the client between requests (session data, in-progress work). If that server dies, the state is lost. Client must reconnect to the same server.
Stateless — the server remembers nothing. Every request carries all the information needed to handle it (JWT token, request body). Any server can handle any request. If a server dies, the load balancer routes to another — client notices nothing.
Stateful: Client must always hit Server A (holds their session)
Stateless: Client hits Server A, B, or C — all identical, all can handle it
Why stateless matters for scale: Stateless servers can be added or removed freely behind a load balancer. Auto-scaling works because any new instance is identical to any existing one. This is why "make your app servers stateless" is one of the first scaling moves — push session state to Redis, not in-memory.
Hot Standby vs Warm Standby vs Cold Standby
Three levels of failover readiness — trading cost for recovery speed.
| Hot Standby | Warm Standby | Cold Standby | |
|---|---|---|---|
| State | Running and in sync, serving read traffic | Running but not serving traffic | Stopped or not yet provisioned |
| Failover time | Seconds (just redirect traffic) | 1–5 minutes (promote and configure) | 10–60 minutes (spin up, restore backup) |
| Cost | Highest (full duplicate running) | Medium (instance running, idle) | Lowest (pay only on failure) |
| Use when | 99.99%+ SLA, zero tolerance for downtime | 99.9% SLA, minutes of downtime OK | Dev/test, or batch systems where hours of downtime is fine |
Real examples:
- AWS RDS Multi-AZ → hot standby (replica is always in sync, failover in ~60s)
- A spare EC2 instance kept running but not in rotation → warm standby
- Restoring from an S3 snapshot to a new instance → cold standby
Hashing
A hash function takes any input (string, number, object) and maps it to a fixed-size output (the hash). Same input always produces the same output. Used everywhere in distributed systems to decide where data lives.
hash("user_123") → 482910
hash("user_456") → 91847
hash("user_789") → 374621
Modulo hashing — the simplest approach. Take the hash, mod by number of nodes to get the target node:
node = hash(key) % number_of_nodes
Works fine until you add or remove a node — then number_of_nodes changes and almost every key maps to a different node. For a cache, this means a cache miss storm. For a DB, it means a massive resharding operation.
Consistent Hashing
Solves the modulo hashing problem — when nodes are added or removed, only a small fraction of keys need to move.
How it works: Both nodes and keys are mapped onto a circular ring (0 to 2³²) using the same hash function. A key is assigned to the first node clockwise from its position on the ring.
0
/ \
Node C Node A
| |
Node B
\ /
2³²
When a node is added, it only takes keys from its immediate clockwise neighbour — roughly 1/n of total keys (n = number of nodes). When a node is removed, its keys move to the next node clockwise. Compare this to modulo hashing where adding one node reshuffles nearly everything.
Virtual nodes: Each physical node gets multiple positions on the ring (100–200 is typical). This distributes load more evenly and handles nodes with different hardware capacities.
Where it applies — not just DBs:
- Distributed cache (Redis Cluster, Memcached) — decide which cache node holds a key. Adding a cache node only invalidates ~1/n of cached keys instead of everything.
- DB sharding (Cassandra, DynamoDB) — partition data across DB nodes with minimal resharding on topology changes.
- Load balancing — route a client to the same backend server consistently (useful for stateful sessions).
Real examples: Cassandra and DynamoDB use consistent hashing for data partitioning. Memcached clients use it to distribute keys across cache nodes.
Hot Keys
A hot key is a single key (in a cache, shard, or queue) that receives a disproportionately large share of traffic — so much that the single node holding it becomes a bottleneck regardless of how many other nodes exist.
Why it happens: Traffic is never perfectly uniform. A celebrity user, a viral post, a popular product — one entity can generate millions of requests while everything else averages a few hundred.
The problem: Consistent hashing and sharding distribute keys evenly on average, but one extremely popular key still lands on one node. That node gets hammered while others sit idle.
Where hot keys appear:
- Cache — one Redis node serving 500K req/sec for a single cache key (e.g. Taylor Swift's profile)
- DB shard — one shard holding a celebrity's data getting 100× more queries than other shards
- Kafka partition — one partition receiving all messages for a popular topic because the partition key is too coarse
Fixes by layer:
| Layer | Fix | How |
|---|---|---|
| Cache | Key fanout | Store identical copies under multiple keys (key:1, key:2, … key:N). Readers pick one randomly. Spreads load across N nodes. |
| Cache | Local in-process cache | Cache the hot value in each app server's memory for 1–5s. Eliminates Redis entirely for that key. |
| DB shard | Shard key redesign | Add a suffix to the key (user_id + random 1–10) to spread across shards. Queries must aggregate. |
| Kafka | Partition key redesign | Use a more granular partition key (e.g. user_id instead of region) to spread messages. |
CAP Theorem
In a distributed system, during a network partition (nodes can't communicate), you must choose between:
- Consistency (C) — every read sees the most recent write. If you can't guarantee this, refuse the request.
- Availability (A) — every request gets a response, even if the data might be stale.
Partition tolerance (P) is not a choice — network partitions happen in any distributed system. So the real choice is always C vs A when a partition occurs.
CAP Triangle
Consistency
/\
/ \
/ \
/ CA \ ← CA = no partition tolerance (single node, not distributed)
/--------\
/CP \ /AP \
/ \/ \
Consistency Availability
(CP systems) (AP systems)
| Choice | Behaviour during partition | Examples |
|---|---|---|
| CP (Consistent + Partition tolerant) | Refuse requests rather than serve stale data | PostgreSQL, Zookeeper, HBase |
| AP (Available + Partition tolerant) | Serve possibly stale data rather than refuse | Cassandra, DynamoDB, CouchDB |
In practice: Most systems aren't purely CP or AP — they let you tune consistency per operation. Cassandra's QUORUM read is CP-like; ONE read is AP-like.
Interview signal:
- "Users can't see stale data under any circumstance" → CP
- "Always respond, even if slightly stale" → AP
- Payment systems → CP. Social feeds → AP.
Thundering Herd
When a large number of clients or processes simultaneously wake up and try to access the same resource, overwhelming it. The name comes from a herd of animals all stampeding through the same gate at once.
Where it appears:
1. Cache expiry — a popular cache entry expires. Every concurrent request sees a miss and races to rebuild from the DB. At 100K req/sec, that's 100K identical DB queries in the same instant. The DB collapses under what is effectively a self-inflicted DDoS.
2. Server restart / cold start — a server restarts with an empty in-process cache. All traffic that would have been served from memory now hits Redis or the DB simultaneously.
3. Client reconnects — a WebSocket or SSE server crashes. All connected clients detect the drop and attempt to reconnect at the exact same moment, slamming the auth service and remaining servers.
4. Retry storms — a downstream service goes slow. All upstream clients hit their timeout at the same time and retry simultaneously, creating a second wave of traffic exactly when the downstream service is most stressed.
Fixes:
| Problem | Fix | How |
|---|---|---|
| Cache expiry | TTL jitter | TTL = base + random(0, 30s) — stagger expiry so keys don't all expire together |
| Cache expiry | Probabilistic early refresh | Start refreshing before expiry with increasing probability as TTL runs down |
| Cache expiry | Request coalescing | Only one request rebuilds the cache; others wait for that result |
| Client reconnects | Reconnect jitter | delay = base * 2^attempt + random(0, 1s) — spread reconnects over a window |
| Retry storms | Exponential backoff + jitter | Double the wait between retries, add randomness so clients don't retry in sync |
Backpressure
When a downstream component is overwhelmed, it signals upstream components to slow down rather than accepting more work than it can handle. The upstream component then slows its rate of sending — or rejects new requests at the entry point — instead of letting the overload cascade deeper into the system.
Without backpressure:
Producer → Queue (100K items backed up) → Consumer (can handle 1K/sec)
Result: Queue grows unboundedly → OOM → crash
With backpressure:
Consumer signals "I'm full" → Producer slows to 1K/sec → Queue stays bounded
How it's implemented:
| Mechanism | How | Used in |
|---|---|---|
| Bounded queue | Queue has a max size. When full, producer blocks or gets an error. | Java BlockingQueue, Kafka producer max.block.ms |
| TCP flow control | Receiver advertises a window size. Sender cannot exceed it. | Built into TCP — automatic |
| Rate limiting | API Gateway rejects requests above a threshold with 429. | Nginx, API Gateway |
| Load shedding | Service deliberately drops low-priority requests under load rather than queuing them. | Real-time systems where stale responses are worse than no response |
Interview signal: Whenever you add a queue, ask "what happens when the consumer falls behind?" If the queue is unbounded, you have a latent crash waiting to happen. Backpressure is the answer.
Fan-out
One write or event triggers delivery to many downstream consumers. The ratio of one-to-many is the core challenge — a single action can explode into millions of operations.
Two models:
Push fan-out (eager) — when a write happens, immediately compute and deliver to all consumers. High write cost, very fast reads.
User posts tweet → fan out to all 10M followers' feeds immediately
Read feed → instant (already precomputed)
Write cost: 10M operations per tweet
Pull fan-out (lazy) — store the write once. Each consumer fetches and merges on read. Low write cost, slower reads.
User posts tweet → store once in author's timeline
Read feed → merge timelines of all followed users at read time
Read cost: O(accounts followed) per feed load
Hybrid (Twitter's actual approach): Push fan-out for normal users (fast reads). Pull fan-out for celebrity accounts (too many followers to push to). Read path merges both.
Where fan-out appears:
- Social feeds (posts → follower feeds)
- Notifications (one event → many device pushes)
- Kafka topics (one message → multiple consumer groups each getting a full copy)
- Redis Pub/Sub (one publish → all channel subscribers)
The scaling problem: Fan-out writes scale with the number of consumers, not the number of writes. A user with 10M followers generates 10M writes per post. Rate-limit posting, use async workers, and batch deliveries to manage this.
Head-of-Line Blocking
When a slow or stuck item at the front of a queue or connection blocks every item behind it, even if those items could be processed immediately. The queue can only move as fast as its slowest item.
HTTP/1.1 example:
Request 1 (slow, 2s) ──► Server ──► Response 1 (after 2s)
Request 2 (fast, 1ms) ──► BLOCKED until Request 1 completes
Request 3 (fast, 1ms) ──► BLOCKED until Request 2 completes
A browser opens 6 connections per domain to work around this (HTTP/1.1 limit), but 6 is still a hard cap.
HTTP/2 fix — multiplexing: Multiple requests share one connection as independent streams. A slow stream doesn't block other streams.
Stream 1 (slow) ──┐
Stream 2 (fast) ──┼──► shared connection ──► all processed independently
Stream 3 (fast) ──┘
Where HOL blocking appears:
| Context | How it manifests | Fix |
|---|---|---|
| HTTP/1.1 | Slow request blocks the connection | HTTP/2 multiplexing |
| DB connection pool | One long query holds a connection; others queue | Query timeouts, separate read/write pools |
| Kafka single partition | One slow consumer blocks all messages behind it | More partitions, consumer timeout config |
| Single-threaded queue | One expensive job blocks all others | Separate queues by job type/priority |
Interview signal: If you have a queue mixing fast and slow jobs (e.g. image resize + video transcode), a backed-up transcode job will delay all image resizes behind it. Separate queues by job type to eliminate HOL blocking between workload classes.
Further Reading in These Docs
These concepts are fully covered elsewhere — no need to duplicate them here:
- Idempotency — full section in Non-Functional Requirements
- RPO / RTO — full section in Non-Functional Requirements
- Security terms (TLS, JWT, OAuth2, RBAC, PII) — full section in Non-Functional Requirements