Home / System Design / Technology Selection

Technology Selection

Which technology to pick and why. Use this after NFRs tell you what the system needs — these are the decision criteria that map requirements to specific tools.


Database

Choose the right database for your access pattern, consistency requirements, and scale. Default to SQL unless you have a clear reason not to.

SQL vs NoSQL — Pick One First

The most important database decision — make this call before comparing specific products.

Choose SQL when... Choose NoSQL when...
Data is structured and relational Data is unstructured, hierarchical, or variable schema
You need ACID transactions You need massive write scale or eventual consistency is OK
Complex queries, joins, aggregations Simple key-value or document lookups
Strong consistency required Horizontal scale is the top priority
Data fits on one server or with replicas Data must be sharded across many nodes
Team knows SQL Column-family or document model fits the access pattern better

Rule of thumb: Default to SQL (PostgreSQL). Only switch to NoSQL when you have a clear reason — scale, schema flexibility, or access pattern that SQL handles poorly.


SQL Databases

Relational databases with ACID guarantees — pick based on scale needs and whether you're on AWS or self-hosted.

DB Best For Limits Avoid When
PostgreSQL General purpose. Complex queries, JSONB, full-text search, geospatial (PostGIS). ~1–2TB comfortable, ~5–10K writes/sec before sharding Need extreme write throughput or massive horizontal scale
MySQL High-read web apps. Slightly faster reads than Postgres on simple queries. Battle-tested. Similar to PostgreSQL Need advanced features (window functions, JSONB) — use Postgres instead
CockroachDB Distributed SQL. Horizontal scale with ACID. Multi-region. Higher latency than single-node Postgres (~5–10ms vs 1–2ms) Latency is critical and data fits on one node
Amazon Aurora Managed PostgreSQL/MySQL. 5× faster than standard MySQL. Auto-scales storage. AWS lock-in. More expensive. On-prem or multi-cloud requirement

NoSQL Databases

Databases optimized for specific access patterns — each trades ACID guarantees for a different combination of scale, flexibility, or query speed.

DB Model Best For Limits Avoid When
Cassandra Wide-column High write throughput, time-series, event logs, messaging. Scales to PB. Eventual consistency only. No joins. Query patterns must be known upfront — design tables around queries. Strong consistency needed, or ad-hoc queries
DynamoDB Key-value + document Serverless scale, unpredictable traffic, AWS ecosystem. Auto-scales reads/writes. AWS lock-in. Expensive at high scale. Limited query flexibility. Complex queries or non-AWS stack
MongoDB Document (JSON) Flexible/nested data, content management, catalogs, user profiles. Consistency weaker than SQL. Not great for relational data. Highly relational data with lots of joins
Redis Key-value + data structures Cache, sessions, leaderboards, pub/sub, geo, rate limiting. Sub-ms reads. Data must fit in RAM. Not a primary DB for large datasets. Durability is critical without backup strategy
InfluxDB / TimescaleDB Time-series Metrics, monitoring, IoT sensor data, financial tick data. Not general purpose. Queries outside time-range patterns are awkward. Data isn't time-series

Database Decision Tree

Follow this path from requirements to a specific database choice in under 30 seconds.

Need ACID transactions across multiple tables?
├── Yes → SQL (PostgreSQL default)
│         ├── Need horizontal write scale? → CockroachDB or shard PostgreSQL
│         └── Managed + AWS? → Aurora
└── No → What's the access pattern?
          ├── High write throughput + time-ordered? → Cassandra
          ├── Key lookups + AWS + auto-scale? → DynamoDB
          ├── Flexible/nested documents? → MongoDB
          ├── Cache + data structures + sub-ms? → Redis
          └── Timestamped metrics/sensor data? → InfluxDB / TimescaleDB

Message Queue

Choose the right queue or stream based on durability needs, throughput, routing complexity, and whether you want managed infrastructure.

Kafka vs SQS vs RabbitMQ vs Redis Pub/Sub

Four queue options compared across the dimensions that matter most for picking one in an interview.

Kafka SQS RabbitMQ Redis Pub/Sub
Durability Yes — messages persisted to disk Yes — managed by AWS Yes No — fire and forget
Replay Yes — consumers can replay from any offset No — consumed messages deleted No No
Multiple consumers Yes — consumer groups each get all messages No — one consumer gets each message Yes — via exchanges/routing Yes — all subscribers get each message
Throughput Millions/sec ~3,000/sec per queue (standard) ~50K/sec Millions/sec
Latency 5–15ms ~1–10ms ~1–5ms < 1ms
Ordering Per partition Best-effort (FIFO queues: strict but slower) Per queue Not guaranteed
Complexity High — brokers, ZooKeeper/KRaft, partitions Low — fully managed Medium Very low

When to use each:

  • Kafka: High throughput event streaming. Multiple independent consumers. Replay needed (audit, ML pipelines). Event sourcing. Use for: Uber GPS, analytics pipelines, activity feeds.
  • SQS: Simple async task queue. AWS ecosystem. Don't want to manage infrastructure. Each task processed by one worker. Use for: email sending, image resizing, background jobs.
  • RabbitMQ: Complex routing rules. Priority queues. Message TTL. Fan-out with filtering. Use for: task routing across microservices, job prioritization.
  • Redis Pub/Sub: Real-time fan-out with no durability requirement. Extremely low latency. Use for: WebSocket fan-out, live notifications, chat room broadcasting.

Rule of thumb: Kafka for streams, SQS for tasks, RabbitMQ for routing, Redis Pub/Sub for real-time broadcast.


Cache

Redis is the default choice for almost every caching use case — pick Memcached only if you need pure string throughput with no other features.

Redis vs Memcached

Feature-by-feature comparison — Redis wins in almost every column, but Memcached has lower overhead for pure get/set workloads.

Redis Memcached
Data structures Strings, Lists, Sets, Sorted Sets, Hashes, Geo, Streams Strings only
Persistence Optional (RDB snapshots, AOF log) None
Pub/Sub Yes No
Cluster / sharding Redis Cluster built-in Client-side sharding only
Threads Single-threaded commands (I/O multi-threaded since v6) Multi-threaded
Throughput ~100K–1M ops/sec ~1M+ ops/sec for simple get/set
Memory overhead Higher (richer data structures) Lower

When to choose Memcached: Pure simple string cache with extremely high get/set throughput and no need for any Redis features. Rare — Redis handles this fine for most systems.

When to choose Redis: Everything else. Sorted sets, pub/sub, geo, persistence, Lua scripts, cluster support. Redis is the default choice.


Load Balancer

Distributes traffic across app servers — the choice depends on protocol, latency requirements, and whether you're on AWS or self-managed.

AWS ALB AWS NLB Nginx / HAProxy
Protocol HTTP/HTTPS, WebSocket TCP/UDP, TLS passthrough HTTP, TCP, anything
Latency ~1–5ms ~100µs — extremely fast ~1ms self-managed
Routing Path-based, host-based, header-based IP + port only Highly configurable
SSL termination Yes No (passthrough) or Yes (TLS) Yes
Managed Fully (AWS) Fully (AWS) Self-managed
Use when Default for web apps, APIs, WebSocket Gaming, VoIP, financial, static IP needed Not on AWS, or need fine-grained control

Rule of thumb: Default to AWS ALB for web and API traffic. Use NLB only when you need sub-millisecond latency, UDP, or a static IP. Use Nginx/HAProxy when you're not on AWS or need custom configuration.


API Gateway

A single entry point for all client requests. Its primary purpose is routing — directing requests to the right backend service. Middleware (auth, rate limiting, logging) is secondary. Clients don't need to know your internal service structure.

Request flow:

Request → Validate → Middleware → Route → Backend → Transform → Cache → Response

What it handles (middleware responsibilities):

  • Auth — JWT validation, API keys, OAuth token introspection
  • Rate limiting — per user, per IP, per endpoint (see building_blocks for algorithms)
  • SSL termination — decrypt HTTPS at gateway, plain HTTP internally (offloads CPU from backends)
  • Request/response transformation — HTTP ↔ gRPC protocol translation, header injection, body format conversion
  • Caching — cache full responses for non-user-specific endpoints (e.g. product catalog, public feeds). Never cache user-specific data.
  • Logging and distributed tracing — inject trace IDs, forward to Datadog/Prometheus
  • Circuit breaker — fail fast to a struggling backend service (see Reliability Patterns in building_blocks)

Two LB layers in practice:

[Clients]
    ↓
[Load Balancer]       ← distributes across gateway instances (AWS ALB)
    ↓
[API Gateway cluster] ← stateless, scales horizontally
    ↓
[Backend Services]    ← gateway load-balances across service instances

The gateway is stateless — no session data stored in the gateway itself. This makes it trivially horizontally scalable: add more instances behind the ALB.

Routing example:

/users/*      →  user-service:8080
/orders/*     →  order-service:8081
/payments/*   →  payment-service:8082

Protocol translation: Gateway can accept HTTP from clients and call backends over gRPC. Backend services use the most efficient protocol without the client needing to know.

Global distribution: For global users, deploy gateway instances in multiple regions + GeoDNS to route each user to the nearest gateway — same strategy as CDN edge nodes.

Technology options:

AWS API Gateway Kong Nginx (as gateway) No gateway
Auth / JWT Built-in Plugin Lua scripts App handles it
Rate limiting Built-in Plugin limit_req module App handles it
Cost Per-request (~$3.50/million) Infrastructure cost Infrastructure cost Zero
Latency added ~10ms ~2–5ms ~1–2ms 0
Managed Fully Self-managed Self-managed
Protocol translation HTTP/WebSocket only HTTP, gRPC HTTP, TCP
Use when AWS ecosystem, serverless, public API High volume, plugin ecosystem, gRPC Already using Nginx as LB Single service, internal API, low scale

When you don't need one: Single-service apps, internal APIs, or when your app server already handles auth and rate limiting cleanly. A gateway adds a hop and complexity — only add it when cross-cutting logic would otherwise be duplicated across many services.

Real systems — what the gateway actually does:

System Scale Key Gateway Responsibilities
Netflix ~2B API req/day Dynamic routing, A/B / canary testing, auth
Uber 2000+ microservices Multi-client routing (rider vs driver app), real-time WebSocket, geo-routing to regional services
Twitter 500M+ tweets/day OAuth auth, heavy timeline caching, public API rate limiting per key
E-commerce Flash sale peaks Rate limiting during flash sales, product catalog caching, /products/ /orders/ /cart/* routing
Chat (WhatsApp-style) Millions of connections WebSocket connection management, JWT auth, message rate limiting per user
Ride sharing Continuous location updates Separate rider/driver routing, real-time WebSocket for location, geo-routing

Interview tip: Say "I'll add an API Gateway for routing and middleware" then move on. Don't over-explain the gateway — it's not the interesting part of the design. You can draw the Load Balancer and API Gateway as a single entry-point box if the interviewer doesn't ask about them specifically.

Rule of thumb: Microservices or multiple client types → API Gateway. Single service or internal API → skip it.


Object Storage

For files, images, and video. All major clouds have a native object store — pick based on your cloud ecosystem, then consider cost and compliance.

By cloud provider:

AWS GCP Azure
Object storage S3 Cloud Storage (GCS) Blob Storage
Presigned URLs S3 Presigned URLs GCS Signed URLs SAS (Shared Access Signature) tokens
CDN CloudFront Cloud CDN Azure CDN / Front Door
CDN signed URLs CloudFront Signed URLs Cloud CDN Signed URLs Azure CDN token auth
Managed encryption keys SSE-S3 (default) Google-managed keys (default) Azure Storage Service Encryption (default)
Customer-managed keys SSE-KMS (AWS KMS) Cloud KMS Azure Key Vault
HSM AWS CloudHSM Cloud HSM Azure Dedicated HSM
Serverless trigger on upload S3 → Lambda GCS → Cloud Functions Blob Storage → Azure Functions
Message queue SQS Cloud Pub/Sub Azure Service Bus

The patterns covered in this guide — presigned URLs, two-bucket strategy, signed CDN URLs, multipart upload — are identical across all three. Only the API names differ.

Cross-cloud comparison — object storage itself:

Amazon S3 Google Cloud Storage Azure Blob Storage Cloudflare R2 Self-hosted (MinIO)
Egress cost ~$0.09/GB ~$0.08/GB ~$0.087/GB Free Infrastructure only
Latency 10–100ms 10–100ms 10–100ms Similar to S3 Depends on hardware
Scale Unlimited Unlimited Unlimited Unlimited Limited by hardware
Ecosystem Massive Strong (GCP native) Strong (Azure native) S3-compatible API S3-compatible API
Use when AWS ecosystem GCP ecosystem Azure ecosystem Egress cost is a concern On-prem / compliance

Rule of thumb: Use whichever matches your cloud ecosystem — S3 on AWS, GCS on GCP, Blob Storage on Azure. Switch to Cloudflare R2 if egress costs are significant regardless of cloud. Self-host only for compliance or data residency requirements.


Match your search solution to dataset size and operational tolerance — zero infra for small datasets, Elasticsearch for billions of documents.

PostgreSQL Full-Text Elasticsearch Typesense / Meilisearch
Setup Zero — already in your DB Heavy — separate cluster Lightweight
Scale Up to ~10M documents comfortably Billions of documents Millions of documents
Fuzzy / typo tolerance No Yes (edit distance) Yes (built-in)
Relevance tuning Basic Highly configurable Good defaults, less configurable
Latency 50–200ms 5–50ms 1–10ms
Built-in caching No Yes (filter cache + request cache) Limited
Operational cost None High (JVM, cluster management) Low

When to use each:

  • PostgreSQL FTS: Search on < 10M rows, exact-word matching is acceptable, no typo tolerance needed. Already on Postgres — zero extra infra. Use tsvector + GIN index; LIKE '%keyword%' forces a full table scan and should never be used at scale.
  • Elasticsearch: Complex search, fuzzy matching, faceted filtering, log analytics, autocomplete at scale. Worth the ops cost.
  • Typesense / Meilisearch: Fast autocomplete/search for smaller datasets, typo tolerance out of the box, much simpler than Elasticsearch.

Keeping Elasticsearch in sync with PostgreSQL:

The hardest part of adding Elasticsearch is keeping it consistent with your source-of-truth DB. Options ranked best to worst:

Approach Lag Risk Verdict
CDC (Debezium) + direct to ES ~seconds Low — captures WAL-level changes including non-app writes Best default
CDC (Debezium) + Kafka + ES consumer ~seconds Very low — Kafka buffers; ES consumer can catch up after lag Best if Kafka already in stack
Scheduled batch sync Minutes Misses hard deletes cleanly; high latency Acceptable only for non-real-time search
Dual write (app writes both) ~ms Partial failure risk — DB write succeeds, ES write fails → silent divergence Avoid

CDC wins because it captures changes from anywhere — DB migrations, admin scripts, other services — not just your application code. Debezium reads the PostgreSQL WAL, publishes change events, and an ES consumer indexes them. Near real-time with no application code changes required.


Quick Reference — Decision Summary

One row per requirement — the fastest path from what you need to which technology to name in an interview.

Need Pick
General relational data, ACID, complex queries PostgreSQL
Horizontal SQL scale, multi-region CockroachDB
High write throughput, time-ordered, massive scale Cassandra
Simple key-value, AWS, auto-scale DynamoDB
Flexible nested documents MongoDB
Cache + data structures + real-time Redis
Time-series metrics / IoT InfluxDB or TimescaleDB
High-throughput event stream, replay, multiple consumers Kafka
Simple async tasks, managed, AWS SQS
Complex message routing, priority RabbitMQ
Real-time broadcast, no durability needed Redis Pub/Sub
File / image / video storage S3 (default)
Full-text search, large dataset Elasticsearch
Fast autocomplete, simple setup Typesense
Web / API traffic load balancing (AWS) AWS ALB
Ultra-low latency, UDP, static IP AWS NLB
Self-managed load balancing Nginx or HAProxy
Public API, auth + rate limiting, AWS AWS API Gateway
High request volume, plugin ecosystem Kong
Auth + rate limiting, already using Nginx Nginx (as gateway)
Long-running workflow, human delays, fault-tolerant Temporal
Simple AWS-native workflow, Lambda orchestration AWS Step Functions