Technology Selection

Which technology to pick and why. Use this after NFRs tell you what the system needs — these are the decision criteria that map requirements to specific tools.

Database

Choose the right database for your access pattern, consistency requirements, and scale. Default to SQL unless you have a clear reason not to.

SQL vs NoSQL — Pick One First

The most important database decision — make this call before comparing specific products.

Choose SQL when...	Choose NoSQL when...
Data is structured and relational	Data is unstructured, hierarchical, or variable schema
You need ACID transactions	You need massive write scale or eventual consistency is OK
Complex queries, joins, aggregations	Simple key-value or document lookups
Strong consistency required	Horizontal scale is the top priority
Data fits on one server or with replicas	Data must be sharded across many nodes
Team knows SQL	Column-family or document model fits the access pattern better

Rule of thumb: Default to SQL (PostgreSQL). Only switch to NoSQL when you have a clear reason — scale, schema flexibility, or access pattern that SQL handles poorly.

SQL Databases

Relational databases with ACID guarantees — pick based on scale needs and whether you're on AWS or self-hosted.

DB	Best For	Limits	Avoid When
PostgreSQL	General purpose. Complex queries, JSONB, full-text search, geospatial (PostGIS).	~1–2TB comfortable, ~5–10K writes/sec before sharding	Need extreme write throughput or massive horizontal scale
MySQL	High-read web apps. Slightly faster reads than Postgres on simple queries. Battle-tested.	Similar to PostgreSQL	Need advanced features (window functions, JSONB) — use Postgres instead
CockroachDB	Distributed SQL. Horizontal scale with ACID. Multi-region.	Higher latency than single-node Postgres (~5–10ms vs 1–2ms)	Latency is critical and data fits on one node
Amazon Aurora	Managed PostgreSQL/MySQL. 5× faster than standard MySQL. Auto-scales storage.	AWS lock-in. More expensive.	On-prem or multi-cloud requirement

NoSQL Databases

Databases optimized for specific access patterns — each trades ACID guarantees for a different combination of scale, flexibility, or query speed.

DB	Model	Best For	Limits	Avoid When
Cassandra	Wide-column	High write throughput, time-series, event logs, messaging. Scales to PB.	Eventual consistency only. No joins. Query patterns must be known upfront — design tables around queries.	Strong consistency needed, or ad-hoc queries
DynamoDB	Key-value + document	Serverless scale, unpredictable traffic, AWS ecosystem. Auto-scales reads/writes.	AWS lock-in. Expensive at high scale. Limited query flexibility.	Complex queries or non-AWS stack
MongoDB	Document (JSON)	Flexible/nested data, content management, catalogs, user profiles.	Consistency weaker than SQL. Not great for relational data.	Highly relational data with lots of joins
Redis	Key-value + data structures	Cache, sessions, leaderboards, pub/sub, geo, rate limiting. Sub-ms reads.	Data must fit in RAM. Not a primary DB for large datasets.	Durability is critical without backup strategy
InfluxDB / TimescaleDB	Time-series	Metrics, monitoring, IoT sensor data, financial tick data.	Not general purpose. Queries outside time-range patterns are awkward.	Data isn't time-series

Database Decision Tree

Follow this path from requirements to a specific database choice in under 30 seconds.

Need ACID transactions across multiple tables?
├── Yes → SQL (PostgreSQL default)
│         ├── Need horizontal write scale? → CockroachDB or shard PostgreSQL
│         └── Managed + AWS? → Aurora
└── No → What's the access pattern?
          ├── High write throughput + time-ordered? → Cassandra
          ├── Key lookups + AWS + auto-scale? → DynamoDB
          ├── Flexible/nested documents? → MongoDB
          ├── Cache + data structures + sub-ms? → Redis
          └── Timestamped metrics/sensor data? → InfluxDB / TimescaleDB

Message Queue

Choose the right queue or stream based on durability needs, throughput, routing complexity, and whether you want managed infrastructure.

Kafka vs SQS vs RabbitMQ vs Redis Pub/Sub

Four queue options compared across the dimensions that matter most for picking one in an interview.

	Kafka	SQS	RabbitMQ	Redis Pub/Sub
Durability	Yes — messages persisted to disk	Yes — managed by AWS	Yes	No — fire and forget
Replay	Yes — consumers can replay from any offset	No — consumed messages deleted	No	No
Multiple consumers	Yes — consumer groups each get all messages	No — one consumer gets each message	Yes — via exchanges/routing	Yes — all subscribers get each message
Throughput	Millions/sec	~3,000/sec per queue (standard)	~50K/sec	Millions/sec
Latency	5–15ms	~1–10ms	~1–5ms	< 1ms
Ordering	Per partition	Best-effort (FIFO queues: strict but slower)	Per queue	Not guaranteed
Complexity	High — brokers, ZooKeeper/KRaft, partitions	Low — fully managed	Medium	Very low

When to use each:

Kafka: High throughput event streaming. Multiple independent consumers. Replay needed (audit, ML pipelines). Event sourcing. Use for: Uber GPS, analytics pipelines, activity feeds.
SQS: Simple async task queue. AWS ecosystem. Don't want to manage infrastructure. Each task processed by one worker. Use for: email sending, image resizing, background jobs.
RabbitMQ: Complex routing rules. Priority queues. Message TTL. Fan-out with filtering. Use for: task routing across microservices, job prioritization.
Redis Pub/Sub: Real-time fan-out with no durability requirement. Extremely low latency. Use for: WebSocket fan-out, live notifications, chat room broadcasting.

Rule of thumb: Kafka for streams, SQS for tasks, RabbitMQ for routing, Redis Pub/Sub for real-time broadcast.

Cache

Redis is the default choice for almost every caching use case — pick Memcached only if you need pure string throughput with no other features.

Redis vs Memcached

Feature-by-feature comparison — Redis wins in almost every column, but Memcached has lower overhead for pure get/set workloads.

	Redis	Memcached
Data structures	Strings, Lists, Sets, Sorted Sets, Hashes, Geo, Streams	Strings only
Persistence	Optional (RDB snapshots, AOF log)	None
Pub/Sub	Yes	No
Cluster / sharding	Redis Cluster built-in	Client-side sharding only
Threads	Single-threaded commands (I/O multi-threaded since v6)	Multi-threaded
Throughput	~100K–1M ops/sec	~1M+ ops/sec for simple get/set
Memory overhead	Higher (richer data structures)	Lower

When to choose Memcached: Pure simple string cache with extremely high get/set throughput and no need for any Redis features. Rare — Redis handles this fine for most systems.

When to choose Redis: Everything else. Sorted sets, pub/sub, geo, persistence, Lua scripts, cluster support. Redis is the default choice.

Load Balancer

Distributes traffic across app servers — the choice depends on protocol, latency requirements, and whether you're on AWS or self-managed.

	AWS ALB	AWS NLB	Nginx / HAProxy
Protocol	HTTP/HTTPS, WebSocket	TCP/UDP, TLS passthrough	HTTP, TCP, anything
Latency	~1–5ms	~100µs — extremely fast	~1ms self-managed
Routing	Path-based, host-based, header-based	IP + port only	Highly configurable
SSL termination	Yes	No (passthrough) or Yes (TLS)	Yes
Managed	Fully (AWS)	Fully (AWS)	Self-managed
Use when	Default for web apps, APIs, WebSocket	Gaming, VoIP, financial, static IP needed	Not on AWS, or need fine-grained control

Rule of thumb: Default to AWS ALB for web and API traffic. Use NLB only when you need sub-millisecond latency, UDP, or a static IP. Use Nginx/HAProxy when you're not on AWS or need custom configuration.

API Gateway

A single entry point for all client requests. Its primary purpose is routing — directing requests to the right backend service. Middleware (auth, rate limiting, logging) is secondary. Clients don't need to know your internal service structure.

Request flow:

Request → Validate → Middleware → Route → Backend → Transform → Cache → Response

What it handles (middleware responsibilities):

Auth — JWT validation, API keys, OAuth token introspection
Rate limiting — per user, per IP, per endpoint (see building_blocks for algorithms)
SSL termination — decrypt HTTPS at gateway, plain HTTP internally (offloads CPU from backends)
Request/response transformation — HTTP ↔ gRPC protocol translation, header injection, body format conversion
Caching — cache full responses for non-user-specific endpoints (e.g. product catalog, public feeds). Never cache user-specific data.
Logging and distributed tracing — inject trace IDs, forward to Datadog/Prometheus
Circuit breaker — fail fast to a struggling backend service (see Reliability Patterns in building_blocks)

Two LB layers in practice:

[Clients]
    ↓
[Load Balancer]       ← distributes across gateway instances (AWS ALB)
    ↓
[API Gateway cluster] ← stateless, scales horizontally
    ↓
[Backend Services]    ← gateway load-balances across service instances

The gateway is stateless — no session data stored in the gateway itself. This makes it trivially horizontally scalable: add more instances behind the ALB.

Routing example:

/users/*      →  user-service:8080
/orders/*     →  order-service:8081
/payments/*   →  payment-service:8082

Protocol translation: Gateway can accept HTTP from clients and call backends over gRPC. Backend services use the most efficient protocol without the client needing to know.

Global distribution: For global users, deploy gateway instances in multiple regions + GeoDNS to route each user to the nearest gateway — same strategy as CDN edge nodes.

Technology options:

	AWS API Gateway	Kong	Nginx (as gateway)	No gateway
Auth / JWT	Built-in	Plugin	Lua scripts	App handles it
Rate limiting	Built-in	Plugin	limit_req module	App handles it
Cost	Per-request (~$3.50/million)	Infrastructure cost	Infrastructure cost	Zero
Latency added	~10ms	~2–5ms	~1–2ms	0
Managed	Fully	Self-managed	Self-managed	—
Protocol translation	HTTP/WebSocket only	HTTP, gRPC	HTTP, TCP	—
Use when	AWS ecosystem, serverless, public API	High volume, plugin ecosystem, gRPC	Already using Nginx as LB	Single service, internal API, low scale

When you don't need one: Single-service apps, internal APIs, or when your app server already handles auth and rate limiting cleanly. A gateway adds a hop and complexity — only add it when cross-cutting logic would otherwise be duplicated across many services.

Real systems — what the gateway actually does:

System	Scale	Key Gateway Responsibilities
Netflix	~2B API req/day	Dynamic routing, A/B / canary testing, auth
Uber	2000+ microservices	Multi-client routing (rider vs driver app), real-time WebSocket, geo-routing to regional services
Twitter	500M+ tweets/day	OAuth auth, heavy timeline caching, public API rate limiting per key
E-commerce	Flash sale peaks	Rate limiting during flash sales, product catalog caching, /products/ /orders/ /cart/* routing
Chat (WhatsApp-style)	Millions of connections	WebSocket connection management, JWT auth, message rate limiting per user
Ride sharing	Continuous location updates	Separate rider/driver routing, real-time WebSocket for location, geo-routing

Interview tip: Say "I'll add an API Gateway for routing and middleware" then move on. Don't over-explain the gateway — it's not the interesting part of the design. You can draw the Load Balancer and API Gateway as a single entry-point box if the interviewer doesn't ask about them specifically.

Rule of thumb: Microservices or multiple client types → API Gateway. Single service or internal API → skip it.

Object Storage

For files, images, and video. All major clouds have a native object store — pick based on your cloud ecosystem, then consider cost and compliance.

By cloud provider:

	AWS	GCP	Azure
Object storage	S3	Cloud Storage (GCS)	Blob Storage
Presigned URLs	S3 Presigned URLs	GCS Signed URLs	SAS (Shared Access Signature) tokens
CDN	CloudFront	Cloud CDN	Azure CDN / Front Door
CDN signed URLs	CloudFront Signed URLs	Cloud CDN Signed URLs	Azure CDN token auth
Managed encryption keys	SSE-S3 (default)	Google-managed keys (default)	Azure Storage Service Encryption (default)
Customer-managed keys	SSE-KMS (AWS KMS)	Cloud KMS	Azure Key Vault
HSM	AWS CloudHSM	Cloud HSM	Azure Dedicated HSM
Serverless trigger on upload	S3 → Lambda	GCS → Cloud Functions	Blob Storage → Azure Functions
Message queue	SQS	Cloud Pub/Sub	Azure Service Bus

The patterns covered in this guide — presigned URLs, two-bucket strategy, signed CDN URLs, multipart upload — are identical across all three. Only the API names differ.

Cross-cloud comparison — object storage itself:

	Amazon S3	Google Cloud Storage	Azure Blob Storage	Cloudflare R2	Self-hosted (MinIO)
Egress cost	~$0.09/GB	~$0.08/GB	~$0.087/GB	Free	Infrastructure only
Latency	10–100ms	10–100ms	10–100ms	Similar to S3	Depends on hardware
Scale	Unlimited	Unlimited	Unlimited	Unlimited	Limited by hardware
Ecosystem	Massive	Strong (GCP native)	Strong (Azure native)	S3-compatible API	S3-compatible API
Use when	AWS ecosystem	GCP ecosystem	Azure ecosystem	Egress cost is a concern	On-prem / compliance

Rule of thumb: Use whichever matches your cloud ecosystem — S3 on AWS, GCS on GCP, Blob Storage on Azure. Switch to Cloudflare R2 if egress costs are significant regardless of cloud. Self-host only for compliance or data residency requirements.

Search

Match your search solution to dataset size and operational tolerance — zero infra for small datasets, Elasticsearch for billions of documents.

	PostgreSQL Full-Text	Elasticsearch	Typesense / Meilisearch
Setup	Zero — already in your DB	Heavy — separate cluster	Lightweight
Scale	Up to ~10M documents comfortably	Billions of documents	Millions of documents
Fuzzy / typo tolerance	No	Yes (edit distance)	Yes (built-in)
Relevance tuning	Basic	Highly configurable	Good defaults, less configurable
Latency	50–200ms	5–50ms	1–10ms
Built-in caching	No	Yes (filter cache + request cache)	Limited
Operational cost	None	High (JVM, cluster management)	Low

When to use each:

PostgreSQL FTS: Search on < 10M rows, exact-word matching is acceptable, no typo tolerance needed. Already on Postgres — zero extra infra. Use tsvector + GIN index; LIKE '%keyword%' forces a full table scan and should never be used at scale.
Elasticsearch: Complex search, fuzzy matching, faceted filtering, log analytics, autocomplete at scale. Worth the ops cost.
Typesense / Meilisearch: Fast autocomplete/search for smaller datasets, typo tolerance out of the box, much simpler than Elasticsearch.

Keeping Elasticsearch in sync with PostgreSQL:

The hardest part of adding Elasticsearch is keeping it consistent with your source-of-truth DB. Options ranked best to worst:

Approach	Lag	Risk	Verdict
CDC (Debezium) + direct to ES	~seconds	Low — captures WAL-level changes including non-app writes	Best default
CDC (Debezium) + Kafka + ES consumer	~seconds	Very low — Kafka buffers; ES consumer can catch up after lag	Best if Kafka already in stack
Scheduled batch sync	Minutes	Misses hard deletes cleanly; high latency	Acceptable only for non-real-time search
Dual write (app writes both)	~ms	Partial failure risk — DB write succeeds, ES write fails → silent divergence	Avoid

CDC wins because it captures changes from anywhere — DB migrations, admin scripts, other services — not just your application code. Debezium reads the PostgreSQL WAL, publishes change events, and an ES consumer indexes them. Near real-time with no application code changes required.

Quick Reference — Decision Summary

One row per requirement — the fastest path from what you need to which technology to name in an interview.

Need	Pick
General relational data, ACID, complex queries	PostgreSQL
Horizontal SQL scale, multi-region	CockroachDB
High write throughput, time-ordered, massive scale	Cassandra
Simple key-value, AWS, auto-scale	DynamoDB
Flexible nested documents	MongoDB
Cache + data structures + real-time	Redis
Time-series metrics / IoT	InfluxDB or TimescaleDB
High-throughput event stream, replay, multiple consumers	Kafka
Simple async tasks, managed, AWS	SQS
Complex message routing, priority	RabbitMQ
Real-time broadcast, no durability needed	Redis Pub/Sub
File / image / video storage	S3 (default)
Full-text search, large dataset	Elasticsearch
Fast autocomplete, simple setup	Typesense
Web / API traffic load balancing (AWS)	AWS ALB
Ultra-low latency, UDP, static IP	AWS NLB
Self-managed load balancing	Nginx or HAProxy
Public API, auth + rate limiting, AWS	AWS API Gateway
High request volume, plugin ecosystem	Kong
Auth + rate limiting, already using Nginx	Nginx (as gateway)
Long-running workflow, human delays, fault-tolerant	Temporal
Simple AWS-native workflow, Lambda orchestration	AWS Step Functions