Non-Functional Requirements
Don't draw boxes until you know what the system demands. For each NFR this doc covers what it means, how the answer changes your architecture layer by layer, key terms, and which real systems make it their top priority. Pick the ones most relevant to the system and let them drive your design.
Scale
How big is the system, and where does the load actually hit? Scale affects every layer — not just the database.
Ask:
- How many daily active users?
- What's the read/write ratio?
- Any bursty traffic patterns (holidays, events)?
DAU → QPS conversion:
A day has 86,400 seconds. In interviews, round that to 100,000 — the 16% error is irrelevant at estimation scale and the mental math becomes trivial.
QPS = DAU × requests_per_user_per_day ÷ 100,000
Worked example with 1M DAU and 10 requests/user/day:
QPS = 1,000,000 × 10 ÷ 100,000
= 10,000,000 ÷ 100,000
= 100 QPS
For peak QPS, multiply by 2–3× (traffic isn't uniform — mornings and evenings are heavier):
Peak QPS = 100 × 3 = 300 QPS
| DAU | QPS (est.) | Layer-by-Layer Impact |
|---|---|---|
| 10K | ~1 | Single server handles everything. No LB, no replicas, no cache needed. (AWS t3.micro, GCP e2-micro — handles thousands of QPS, overkill at 1 QPS) |
| 100K | ~10 | Add LB for redundancy (not load — 10 QPS is trivial). Redis cache if queries are expensive. CDN for static assets. (AWS: ALB + 2× t3.small. GCP: Cloud Load Balancing + 2× e2-small) |
| 1M | ~100 | Multiple app servers behind LB. DB read replicas (1–2). Connection pooler to avoid exhausting DB connections. (AWS: RDS PostgreSQL — up to 5 read replicas, RDS Proxy for connection pooling. GCP: Cloud SQL PostgreSQL — up to 10 read replicas, Cloud SQL Proxy for connection pooling) |
| 10M | ~1,000 | Kafka for async writes. Redis Cluster for cache. DB read replicas still sufficient for reads — don't shard yet. (AWS: Aurora — up to 15 read replicas, handles 100K+ reads/sec, Aurora Serverless v2 auto-scales. GCP: AlloyDB for PostgreSQL — up to 16 read pool nodes, auto-scales. Sharding threshold is 10K–50K QPS or when data volume outgrows a single host) |
| 100M+ | ~10,000+ | Multi-region everything. Global LB. DB sharding or distributed SQL. Connection pooling critical at this tier. CDN serves 80%+ of traffic. (AWS: Aurora Global Database, RDS Proxy, CloudFront + Route53 latency routing. GCP: Cloud Spanner for distributed SQL, Cloud SQL Proxy, Cloud CDN + global anycast Cloud Load Balancing. Cloud-agnostic: CockroachDB, Cloudflare CDN) |
How scale changes tech choices per layer:
- App servers: Single instance → horizontal auto-scaling group → stateless containers (AWS ECS/EKS, GCP Cloud Run/GKE, or self-managed K8s)
- Database: PostgreSQL single → read replicas → sharding → NoSQL or distributed SQL
- Cache: No cache → Redis single → Redis Cluster (partitioned across nodes)
- LB: Not needed → single regional LB (AWS ALB, GCP Cloud Load Balancing) → multi-AZ → global load balancing (AWS Route53 latency routing, GCP global anycast Cloud Load Balancing)
- CDN: Not needed → CDN for static (AWS CloudFront, GCP Cloud CDN, or Cloudflare) → full edge caching with dynamic content
Read/Write ratio shapes your architecture:
- Read-heavy (100:1) → cache aggressively (Redis, CDN), DB read replicas. Twitter feed, Reddit homepage.
- Write-heavy (1:10) → message queues (Kafka) to absorb bursts, append-only logs, async consumers. Consider CQRS (separate write model from read model) to prevent reads from competing with writes. Logging pipeline, analytics ingestion.
- Balanced → general-purpose horizontal scaling.
Burst traffic: If traffic spikes at predictable times (Black Friday, live events), design for auto-scaling and queue-based buffering, not steady-state peak capacity.
Storage estimate: DAU × avg_event_size × events_per_day × retention_days
Most critical for: Twitter/X (read-heavy feed), YouTube (video storage + CDN), Uber (surge traffic), ticketing systems (flash sales).
Latency
How fast must the system respond? This determines where you place compute, what stays synchronous, and what you offload.
Ask:
- What's the acceptable p99 response time?
- Are there specific operations that must be fast?
What is p99? Sort all requests by response time. P99 is the 99th one — 99% of requests completed faster than this. P50 is the median. P99.9 is 1 in 1,000. At 1,000 QPS, p99 means 10 requests every second are slower than that number.
| Target | What It Means | Design Impact |
|---|---|---|
| < 10ms | Ultra-low. Real-time systems. | Data must live in-process memory or local Redis. No network hops. Compute co-located with data. |
| < 100ms | Feels instant to users. | Read from Redis cache (same DC, |
| < 500ms | Interactive. Standard web UX. | Cache reads from Redis. Async writes (publish to queue, return 200). DB reads must hit indexes. |
| 1–5s | Tolerable for complex queries. | Background jobs for heavy computation. DB aggregations OK if indexed. Show loading states. |
| > 5s | Batch is fine. | Async processing, queues, offline jobs. No sync response needed. |
Why p99 and not average — average hides pain: 99 requests complete in 50ms. One hangs for 10,000ms.
Average = (99×50 + 10,000) / 100 = 149ms ← looks healthy
P99 = 10,000ms ← system is on fire
At Walmart's scale — 260 million customers a week — 1% is 2.6 million people experiencing that hang. Average would never surface it.
Why p99 specifically:
| Percentile | What it catches | Problem |
|---|---|---|
| P50 | Half your users | Too lenient — misses most of the tail |
| P95 | Most issues | Can miss slow-building degradation |
| P99 | Tail latency — industry standard | Sensitive without being noisy |
| P100 | Single worst request | Always an outlier — useless for decisions |
P99 is the SLA standard because it catches real problems early without false-alarming on one-off outliers.
P99 spike = early overload warning: When a system gets busy, it fails at the tail first — not uniformly.
| State | P50 | P99 |
|---|---|---|
| Healthy | 40ms | 80ms |
| Getting busy | 45ms | 400ms |
| Overloaded | 80ms | 2,500ms |
| Crashed | timeout | timeout |
By the time P50 degrades, you're already in serious trouble. P99 gives you the window to act — shed load, scale out, open a circuit — before the majority of users feel it.
Setting the threshold — from your SLA, not arbitrary: Set the shedding trigger at a fraction of your SLA, not after you've already breached it.
Inventory reservation SLA = 500ms → shed when p99 > 400ms (80% of SLA)
Homepage SLA = 2,000ms → shed when p99 > 1,600ms (80% of SLA)
The gap between threshold and SLA is your response window. Trigger too late and you're already breaking the promise.
How it's measured: Rolling window over the last N seconds (typically 10s), recomputed every 5 seconds. In practice, use Micrometer or Prometheus histograms rather than sorting raw request lists — same concept, far more efficient.
Latency vs throughput: Latency is how fast one request completes. Throughput is how many requests per second the system handles. Optimizing for one can hurt the other — batching increases throughput but adds per-request latency. Know which the interviewer cares about.
p99 matters more than average. If the interviewer says "users complain it feels slow", think tail latency, not mean. One unindexed query or missing cache can blow your p99 while your average looks fine.
Per-component latency costs: For exact numbers per hop (LB, Redis, DB, Kafka, S3, etc.) and worked end-to-end request path examples, see the Latency Reference Table in System Design Layers.
Most critical for: Search/autocomplete (Yelp, Google — < 100ms), stock trading (HFT — microseconds), multiplayer gaming, ride-matching (Uber — driver must get request fast).
Availability
How much downtime is acceptable? This drives redundancy, replication topology, and failover strategy across all layers.
Ask:
- What's the uptime requirement?
- What happens to users if this goes down?
| SLA | Downtime/Year | Downtime/Month | What It Looks Like Across Layers |
|---|---|---|---|
| 99% | ~3.6 days | ~7.2 hours | Single region. Single DB. Basic health check restarts crashed app server. |
| 99.9% | ~8.7 hours | ~43 min | Multi-AZ LB. 2+ app server instances. DB with automatic failover replica (AWS RDS Multi-AZ ~60s switchover, GCP Cloud SQL HA ~60s switchover). Redis Sentinel for cache failover. |
| 99.99% | ~52 min | ~4.3 min | Active-active across 2 regions. DB replication across regions with < 1min failover. CDN absorbs traffic if origin partially down. App servers auto-scale and auto-replace. |
| 99.999% | ~5 min | ~26 sec | No single point of failure anywhere. LB: multiple active nodes. App: blue/green deploys with instant rollback. DB: synchronous multi-region replication (AWS Aurora Global, GCP Spanner, or CockroachDB). Cache: Redis Cluster across AZs. Queue: Kafka with replication factor 3. |
How each layer achieves redundancy:
- Load Balancer: Health checks drop unhealthy app servers from rotation in seconds. Multi-AZ deployment so one AZ outage doesn't take the LB down.
- App Servers: Stateless (no local state) so any instance can handle any request. Auto-scaling group replaces failed instances automatically.
- Cache (Redis): Redis Sentinel (monitors, auto-promotes replica on primary failure). Redis Cluster (shards + replicas, handles node loss).
- Database: Primary + read replica → automatic failover on primary crash. Multi-region → async or sync replication depending on consistency needs.
- Message Queue: Kafka replication factor ≥ 3 means 2 broker deaths don't lose messages.
- CDN: CDN providers are globally redundant by design (AWS CloudFront, GCP Cloud CDN, Cloudflare).
CAP Theorem tradeoff: During a network partition, choose availability (keep serving, possibly stale) or consistency (stop serving until consistent). Most consumer apps choose availability. Payment systems choose consistency.
Graceful degradation: Netflix shows cached thumbnails when recommendations service is down. Don't fail completely — fail partially.
SLI / SLO / SLA — know the difference:
- SLI (Service Level Indicator) — the actual measured metric. E.g., "97.8% of requests completed in < 200ms this week."
- SLO (Service Level Objective) — your internal target. E.g., "99.9% of requests must complete in < 200ms." This is what your team is held to.
- SLA (Service Level Agreement) — the contractual commitment to customers, with financial penalties for breach. Always looser than your SLO (you'd go out of business otherwise).
Error budget: 1 − SLO. A 99.9% SLO gives you ~43 min/month to spend on incidents and deploys. When the budget is gone, freeze non-critical changes until the window resets.
Each extra 9 is roughly 10× harder and more expensive. Push back if the requirement seems over-engineered.
Most critical for: Payment processors (Stripe, Visa — 99.999%), AWS infrastructure, healthcare systems, any system where downtime = revenue loss or safety risk.
Consistency
When a write happens, when do all nodes and users see it? This is the core CAP tradeoff in practice.
Ask:
- Can users see slightly stale data?
- If two users write at the same time, does it matter which one wins?
| Model | What It Means | When to Use | Real Example |
|---|---|---|---|
| Strong (Linearizable) | All reads see the latest write immediately across all nodes. Each operation appears to take effect at a single instant in time. | Payments, inventory, bank balances | PostgreSQL, Zookeeper, Spanner |
| Read-your-writes | You always see your own latest write. Other users may lag briefly. | Profile updates, settings | Most social apps for own data |
| Eventual | Writes propagate across nodes eventually. Briefly stale OK. | Social feeds, like counts, view counts | Cassandra, DynamoDB default, DNS |
| Causal | Causally related writes are seen in order by all nodes. Unrelated writes can appear in any order. | Comments/replies, collaborative editing, chat | MongoDB sessions, DynamoDB transactions |
ACID vs BASE: Relational databases give you ACID (Atomic, Consistent, Isolated, Durable) — all or nothing, always correct. Most NoSQL databases give you BASE (Basically Available, Soft state, Eventually consistent) — always up, eventually right. Choosing a DB is often choosing between these two philosophies.
Idempotency: A write operation is idempotent if calling it multiple times produces the same result as calling it once. Critical when clients retry on network failure — without it, a retried payment charges the user twice.
Causal consistency explained: If Alice posts "I'm going to the store" and Bob replies "I'll come with you", causal consistency guarantees Carol always sees Alice's post before Bob's reply — because Bob's reply causally depends on Alice's post. But Carol might see Alice's post before or after Dave's unrelated status update — that's fine, they're not causally linked.
This is stronger than eventual (which could show Bob's reply before Alice's post) but weaker than strong (which globally orders every single write). It's the right choice when order matters within a thread or conversation, but not globally.
Interview signal: "It's fine if the like count is off by a few seconds" → eventual consistency, scale horizontally. "Double-charging a user is unacceptable" → strong consistency, accept the latency cost.
Most critical for: Banking and payments (double-spend prevention), inventory systems (Amazon — can't oversell), booking systems (airline seats, hotel rooms).
Idempotency
If a client retries a request, will it cause duplicate side effects? This shapes how you design APIs and payment flows.
Ask:
- Can clients retry failed requests safely?
- Are there operations where duplicates are catastrophic (charges, transfers, order submissions)?
The problem: A client sends a payment request. The server processes it, but the response is lost in transit. The client retries. Without idempotency, the user gets charged twice.
The solution — idempotency keys: Client generates a unique key per logical operation (e.g., UUID) and sends it with the request. Server stores (idempotency_key → result) in Redis or DB. On duplicate request: return the stored result, skip re-execution.
| Operation Type | Naturally Idempotent? | Fix |
|---|---|---|
| GET, DELETE | Yes (GET reads, DELETE on missing is no-op) | Nothing needed |
| PUT (replace entire resource) | Yes | Nothing needed |
| POST (create, charge, transfer) | No | Add idempotency key |
| Message consumer processing | No | Track processed message IDs in DB |
Idempotency in queues: A Kafka consumer that crashes mid-processing will re-receive the same message. Design consumers to be idempotent — check if the event was already processed (by storing the event ID) before acting on it.
Most critical for: Payment APIs (Stripe uses idempotency keys on every charge endpoint), order submission, booking systems, any POST that creates or transfers.
Durability
How much data loss is acceptable if the system crashes or a node goes down?
Ask:
- If we lose a server right now, what's the worst acceptable outcome?
- Can we replay events from a log?
RPO = Recovery Point Objective — how much data can we lose? Measured in time (0ms, 1s, 1hr means you lose up to that much data). RTO = Recovery Time Objective — how long can the system be down during recovery?
| Term | Definition | Design Impact | Latency Cost |
|---|---|---|---|
| RPO = 0 | Zero data loss. | Synchronous replication: primary waits for all replicas to confirm before acking the write. | +5–20ms per write at DB layer. +10–50ms if two-phase commit across services. |
| RPO = seconds | Tiny loss OK. | Async replication. WAL (write-ahead log) shipped to replica continuously. | No extra latency — write acks immediately, replication happens in background. |
| RPO = hours | Some loss tolerable. | Periodic snapshots or nightly backups. | No latency impact. |
| RTO = seconds | Must recover near-instantly. | Hot standby replica already running, promoted automatically on failure (~30–60s). | No latency impact on normal path. |
| RTO = minutes | Fast recovery needed. | Warm standby: replica exists but not serving traffic. Promoted manually or semi-auto. | — |
| RTO = hours | Slower recovery OK. | Restore from backup. Spin up new instance. | — |
RPO=0 adds latency at every layer that writes:
- DB: Synchronous replication adds 5–20ms (waiting for replica in another AZ to confirm).
- App layer: If using distributed transactions (two-phase commit), add 10–50ms.
- Queue: Kafka with
acks=all(wait for all in-sync replicas) adds 2–5ms vsacks=1.
This is the direct tradeoff: stronger durability = higher write latency.
Real examples:
- Banking: RPO = 0. Every transaction written synchronously to multiple replicas before confirmation.
- Social media posts: RPO = seconds is fine. Async replication acceptable.
- Object storage: 11 nines of durability via cross-AZ redundant storage (AWS S3, GCP Cloud Storage).
Most critical for: Banking and financial systems, medical records (Epic, FHIR), legal document storage, payment transaction logs — any system where lost data = legal or financial liability.
Fault Tolerance
How well does the system handle partial failures without going fully down?
Ask:
- What happens when one server crashes?
- What happens when a whole datacenter goes down?
- What if a dependency is slow or unavailable?
| Failure Type | Strategy | Example |
|---|---|---|
| Single node crash | Redundant replicas, auto-failover | DB primary/replica, load balancer health checks |
| Slow dependency | Timeouts + circuit breaker | Stop calling a failing service; return fallback |
| Datacenter outage | Multi-AZ or multi-region active-active | Route traffic to surviving region |
| Data corruption | Checksums, write-ahead logs, point-in-time restore | Detect and roll back bad writes |
| Cascading failures | Bulkheads (isolate failure domains), rate limiting | Don't let one slow service take down everything |
Circuit breaker pattern: If a downstream service fails N times in a row, stop calling it for a period. Return a cached/default response. Let the dependency recover before retrying.
Retry with exponential backoff + jitter: When a request fails, wait before retrying — and double the wait each attempt (backoff). Add random jitter so all retrying clients don't slam the service at the same moment (thundering herd). A common sequence: retry after 1s, 2s, 4s, 8s with ±30% jitter, then give up and dead-letter.
Dead Letter Queue (DLQ): When a message fails processing repeatedly (after N retries), route it to a DLQ instead of blocking the queue. The DLQ holds poisoned messages for inspection and manual replay. Without a DLQ, one bad message can stall an entire consumer group indefinitely. (AWS SQS dead-letter queues, GCP Pub/Sub dead-letter topics, or a separate Kafka topic).
Most critical for: Microservices architectures (each service can fail independently), distributed databases, any system with SLA > 99.9%.
Security
What data does the system handle and who should access it? Drives auth, encryption, and regulatory design.
Ask:
- Does this handle PII, payments, or health data?
- Who are the users — public, internal, B2B?
Key terms:
- PII (Personally Identifiable Information) — any data that can identify a person: name, email, phone, SSN, IP address. Triggers GDPR/HIPAA obligations.
- TLS (Transport Layer Security) — encrypts data in transit (the "S" in HTTPS). Prevents interception.
- AES-256 (Advanced Encryption Standard) — standard algorithm for encrypting data at rest. Used in S3, databases, filesystems.
- JWT (JSON Web Token) — signed token the client sends with each request to prove identity. Stateless, server doesn't store sessions.
- OAuth2 — standard for delegated auth. "Sign in with Google" is OAuth2. Separates identity from your app.
- mTLS (Mutual TLS) — both sides verify certificates. Used for service-to-service auth inside your system.
- RBAC (Role-Based Access Control) — users get roles (admin, editor, viewer), roles get permissions. Simpler than per-user rules.
- ACL (Access Control List) — per-resource list of who can do what. More granular than RBAC (e.g. S3 bucket policies).
Security by layer:
| Layer | What Goes Here |
|---|---|
| CDN | DDoS protection, WAF (Web Application Firewall) blocks malicious requests before they reach origin |
| Load Balancer | TLS termination (decrypt HTTPS here, forward HTTP internally), IP whitelisting |
| API Gateway | Authentication (verify JWT/OAuth token), rate limiting (token bucket), request validation |
| App Server | Authorization (RBAC checks — "can this user do this action?"), input validation, business logic security |
| Cache (Redis) | Don't cache raw PII if avoidable. Redis AUTH password. Encrypt sensitive values if stored. |
| Database | AES-256 encryption at rest. Row-level security for multi-tenant data. Least-privilege DB users. Audit log here — append-only table logging who accessed what and when. |
| Object Storage | Signed URLs for private files (time-limited access). Bucket policies. Server-side encryption. |
Most critical for: Healthcare (HIPAA — audit every access to patient records), financial systems (PCI-DSS — card data tokenized immediately), auth systems (OAuth provider), any multi-tenant SaaS.
Compliance
Are there legal or regulatory constraints that shape the architecture?
Ask:
- What region are users in?
- Does this handle health, financial, or personal data?
Key terms:
- GDPR (General Data Protection Regulation) — EU law. Applies to any system with EU users, regardless of where the company is located.
- HIPAA (Health Insurance Portability and Accountability Act) — US law governing health data. Applies to any app handling patient records.
- PCI-DSS (Payment Card Industry Data Security Standard) — required for any system that stores, processes, or transmits card data.
- SOC 2 — US auditing standard for SaaS companies. Type I = point-in-time assessment. Type II = 6 months of continuous evidence. Required by enterprise buyers.
| Regulation | Who It Affects | Key Architecture Constraint |
|---|---|---|
| GDPR (EU) | Any system with EU users | Data residency in EU. Right to delete (complicates append-only logs). Breach notification in 72hrs. |
| HIPAA (US healthcare) | Medical records, health apps | Audit log every data access. Encryption in transit and at rest. Business associate agreements with vendors. |
| PCI-DSS (payments) | Any system touching card data | Card data never stored raw — tokenize immediately on receipt. Annual third-party audits. Network segmentation. |
| SOC 2 | B2B SaaS | Documented security controls. Access reviews. Incident response plan. |
GDPR complicates event-sourcing: Append-only logs make "right to delete" hard — you can't erase a past event. Solve with tombstone records or keep PII in a separate deletable store and only store user IDs in the event log.
Most critical for: Healthcare apps, payment processors, social platforms with EU users, any enterprise B2B SaaS sold to regulated industries.
Monitoring & Observability
How do you know the system is healthy in production? Drives logging, metrics, and alerting design.
Ask:
- Do you need real-time alerting?
- How quickly must the team detect and diagnose production issues?
| Signal | What It Covers | Tools |
|---|---|---|
| Metrics | QPS, latency, error rate, CPU/memory/disk | Prometheus, Datadog, AWS CloudWatch, GCP Cloud Monitoring |
| Logs | What happened and in what order | ELK stack, Splunk, AWS CloudWatch Logs, GCP Cloud Logging |
| Traces | Where time was spent across services | Jaeger, Zipkin, AWS X-Ray, GCP Cloud Trace |
| Alerts | Notify when SLA is breached | PagerDuty, Opsgenie |
The four golden signals (Google SRE): Latency, Traffic, Errors, Saturation. Build monitoring around these first.
- Latency — how long requests take (track p99, not average)
- Traffic — how much load the system is under (QPS, requests/sec)
- Errors — rate of failed requests (5xx errors, timeouts, exceptions)
- Saturation — how "full" a resource is. A CPU at 95% is saturated. A disk at 98% capacity is saturated. Saturation predicts future failure — a resource approaching 100% will soon become a bottleneck and cause latency spikes or crashes. Monitor: CPU %, memory %, disk I/O utilization, DB connection pool usage, queue depth.
Most critical for: Any system with a strict SLA, microservices (failures are hard to trace), Netflix-style chaos engineering, financial systems where bugs cost real money.
Environment Constraints
Are there non-standard constraints on the environment the system runs in?
Ask:
- Are clients on mobile or constrained devices?
- Are there low-bandwidth or offline scenarios to handle?
| Constraint | Design Impact |
|---|---|
| Mobile clients | Minimize payload size. Compress responses. Offline-first with local cache. |
| Low bandwidth (3G/rural) | Adaptive bitrate streaming (YouTube, Netflix). Delta sync instead of full sync. |
| Limited battery | Batch network calls. Avoid polling — use push (WebSockets, FCM). |
| Edge/IoT devices | Lightweight protocols (MQTT). Local processing before cloud sync. |
| Offline-first | Local DB (SQLite), sync on reconnect, conflict resolution strategy. |
Most critical for: Uber driver app (poor network in some cities), Google Maps offline, WhatsApp (works on 2G), IoT sensor pipelines, healthcare apps in hospitals with restricted networks.
Quick Reference — Which NFR Matters Most
| System | Top NFRs to Prioritize |
|---|---|
| Banking / payments | Consistency, Idempotency, Durability, Security, Compliance |
| Social feed (Twitter, Instagram) | Scale, Availability, Latency |
| Healthcare records | Durability, Security, Compliance, Availability |
| Search / autocomplete (Yelp, Google) | Latency, Scale |
| Ride-sharing (Uber) | Availability, Latency, Fault Tolerance, Environment |
| Video streaming (Netflix, YouTube) | Scale, Availability, Latency, Environment |
| Chat / messaging (WhatsApp) | Availability, Durability, Environment |
| Ticketing / booking (Airbnb, airlines) | Consistency, Availability, Scale |
| Enterprise SaaS | Security, Compliance, Availability |
| IoT / sensor pipeline | Scale, Fault Tolerance, Environment, Durability |