system-design

Category: Data Risk: Medium risk ★ 4.7 · Rating 4.7/5 (1434) wondelai/skills MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessfilesystem_access

name: system-design
description: 'Design scalable distributed systems using structured approaches for load balancing, caching, database scaling, and message queues. Use when the user mentions "system design", "scale this", "high availability", "rate limiter", "design a URL shortener", "system design interview", "capacity planning", or "distributed architecture". Also trigger when estimating infrastructure requirements, choosing between microservices and monoliths, or designing for millions of concurrent users. Covers common system designs and back-of-the-envelope estimation. For data fundamentals, see ddia-systems. For resilience, see release-it.'
license: MIT
metadata:
author: wondelai
version: "1.2.0"

System Design Framework

A structured approach to designing large-scale distributed systems. Apply these principles when architecting new services, reviewing designs, estimating capacity, or preparing for system design discussions.

Core Principle

Start with requirements, not solutions. Jumping to architecture before understanding constraints produces over- or under-engineered systems. Scalable systems are assembled from well-understood building blocks (load balancers, caches, queues, databases, CDNs) — the skill lies in choosing the right blocks, sizing them with estimates, and owning the tradeoffs each choice introduces.

Scoring

Goal: 10/10. Rate any system design 0-10: a 10/10 states requirements explicitly, includes back-of-the-envelope estimates, uses appropriate building blocks, addresses scaling and reliability, and acknowledges tradeoffs. Always state the current score and the specific improvements needed to reach 10/10.

The System Design Framework

Six areas for building reliable, scalable distributed systems:

1. The Four-Step Process

Core concept: Every design follows four stages: (1) understand the problem and establish scope, (2) propose a high-level design and get buy-in, (3) dive deep into critical components, (4) wrap up with tradeoffs and future improvements.

Why it works: Without structure, designs either stay too abstract or get lost in premature detail. The four steps invest time proportionally — broad strokes first, depth where it matters.

Key insights:

Step 1 (~5-10 min): clarifying questions, functional and non-functional requirements, agreed scale (DAU, QPS, storage)
Step 2 (~15-20 min): high-level diagram with APIs, services, data stores, data flow arrows
Step 3 (~15-20 min): design the 2-3 hardest or most critical components in detail
Step 4 (~5 min): tradeoffs, bottlenecks, future improvements
Never skip Step 1 — ambiguous scope wastes all downstream effort; get explicit agreement on assumptions

Code applications:

Context	Pattern	Example
New service kickoff	One-page design doc covering all four steps before coding	Requirements, API contract, data model, capacity estimate, then implementation
Architecture review	Walk reviewers through the steps sequentially	Scope, diagram, deep-dive on riskiest component, open questions
Incident postmortem	Trace the failure through the four-step lens	Which requirement was missed? Which block failed? What tradeoff bit us?

See: references/four-step-process.md

2. Back-of-the-Envelope Estimation

Core concept: Use powers of two, latency numbers, and simple arithmetic to estimate QPS, storage, bandwidth, and server count before committing to an architecture.

Why it works: Estimation prevents over-provisioning (wasted money) and under-provisioning (outages under load). A 2-minute calculation can save weeks of rework.

Key insights:

Powers of two: 2^10 ≈ 1 thousand, 2^20 ≈ 1 million, 2^30 ≈ 1 billion, 2^40 ≈ 1 trillion
Latency: memory read ~100 ns, SSD read ~100 us, disk seek ~10 ms, same-datacenter round trip ~0.5 ms, cross-continent ~150 ms
Availability nines: 99.9% = 8.77 hours downtime/year; 99.99% = 52.6 minutes/year
QPS: DAU x actions-per-day / 86,400 seconds; peak is typically 2-5x average
Storage: records-per-day x record-size x retention
Round aggressively — the goal is order of magnitude, not precision

Code applications:

Context	Pattern	Example
Capacity planning	Estimate QPS, multiply by growth factor	100M DAU x 5 actions / 86400 = ~5,800 QPS avg, ~30K peak
Storage budgeting	Per-record size x volume x retention	500M tweets/day x 300 bytes x 365 days = ~55 TB/year
SLA definition	Convert nines to allowed downtime	Four nines = ~52 minutes downtime per year

See: references/estimation-numbers.md

3. Building Blocks

Core concept: Scalable systems are assembled from a standard toolkit: DNS, CDN, load balancers, reverse proxies, application servers, caches, message queues, and consistent hashing.

Why it works: Each block solves a specific scaling or reliability problem. Knowing when and why to introduce each prevents both premature complexity and avoidable bottlenecks.

Key insights:

Load balancers: L4 (transport layer — fast, simple) vs L7 (application layer — content-aware routing)
Cache layers: client, CDN, web server, application (Redis/Memcached), database query cache
Cache strategies: cache-aside (app manages), read-through, write-through (synchronous), write-behind (asynchronous)
Message queues (Kafka, RabbitMQ, SQS): decouple producers from consumers, absorb spikes, enable async processing
Consistent hashing: distributes keys across nodes with minimal redistribution when nodes change

Code applications:

Context	Pattern	Example
Read-heavy workload	Cache-aside Redis in front of database	Cache user profiles with TTL; invalidate on write
Traffic spikes	Message queue between API and workers	Enqueue image-resize jobs; workers pull at their own pace
Global users	CDN for static assets	Serve JS/CSS/images from edge; origin serves only API
Uneven load	Consistent hashing for shard assignment	Adding a node moves only ~1/n keys

See: references/building-blocks.md

4. Database Design and Scaling

Core concept: Choose SQL vs NoSQL based on data shape and access patterns; scale vertically first, then horizontally (replication and sharding) when vertical limits are reached.

Why it works: The database is usually the first bottleneck. Understanding replication, sharding, and denormalization tradeoffs delays expensive re-architectures and makes growth deliberate.

Key insights:

Vertical scaling is simpler but has a ceiling; horizontal is harder but nearly unlimited
Replication: leader-follower (one writer, many readers) for read-heavy; multi-leader for multi-region writes
Sharding: hash-based (even distribution, hard range queries), range-based (easy ranges, hotspot risk), directory-based (flexible, extra lookup)
SQL for ACID transactions, joins, defined schema; NoSQL for flexible schema, horizontal scale, very high write throughput
Denormalization trades storage and write complexity for read speed — use when reads dominate and data changes rarely
Celebrity/hotspot problem: one hot shard needs secondary partitioning or a cache layer

Code applications:

Context	Pattern	Example
Read-heavy API	Leader-follower with read replicas	Reads to replicas, writes to leader; accept slight lag
User data at scale	Hash-based sharding on user_id	hash(user_id) % num_shards; even, independent shards
Analytics dashboard	Denormalized materialized views	Pre-join and aggregate nightly; serve from materialized table

See: references/database-scaling.md

5. Common System Designs

Core concept: Most systems are variations of a small set of well-known designs: URL shortener, rate limiter, notification system, news feed, chat, search autocomplete, web crawler, unique ID generator.

Why it works: A mental library of known designs lets you recognize which pattern a new problem resembles and adapt it, rather than inventing from scratch.

Key insights:

URL shortener: base62 encoding, key-value store, 301 vs 302 redirect tradeoff (caching vs analytics)
Rate limiter: token bucket or sliding window at the gateway; return 429 with Retry-After
News feed: fanout-on-write (push at post time) vs fanout-on-read (pull at read time); hybrid for celebrities
Chat: WebSocket for real-time bidirectional messages, queue for delivery guarantees, heartbeat presence service
Autocomplete: trie of top-k frequent queries; precompute and cache popular prefixes
Web crawler: BFS with URL frontier, politeness (robots.txt, per-domain rate limit), dedup via content hash
Unique IDs: UUID (simple, no coordination) vs Snowflake (64-bit, time-sortable, datacenter-aware)

Code applications:

Context	Pattern	Example
Short link service	Base62-encode auto-increment ID or hash	`https://short.ly/a1B2c3` maps to a key-value row
API protection	Token bucket at gateway	100 tokens/min per key; steady refill; reject with 429
Social feed	Hybrid fanout	Precompute feeds for <10K-follower accounts; merge celebrity posts at read time

See: references/common-designs.md

6. Reliability and Operations

Core concept: A system is only as good as its ability to stay up, recover, and be observed. Health checks, monitoring, logging, and deployment strategies are first-class design concerns, not afterthoughts.

Why it works: Production systems fail in ways diagrams never predict. Operational readiness — metrics, alerts, rollback plans, redundancy — determines whether a failure is a blip or an outage.

Key insights:

Health checks: liveness (is the process alive?) and readiness (can it serve traffic?) — Kubernetes uses both
Three pillars of observability: metrics (Prometheus, Datadog), logging (ELK, CloudWatch), tracing (Jaeger, Zipkin)
Deployments: rolling (gradual), blue-green (instant switch between identical environments), canary (small percentage first)
Disaster recovery: RPO (acceptable data loss) and RTO (acceptable recovery time) drive backup and failover strategy
Multi-datacenter: active-passive (failover) or active-active (requires data sync and conflict resolution)
Autoscaling: scale on CPU, memory, queue depth, or custom metrics; always set min and max counts

Code applications:

Context	Pattern	Example
Zero-downtime deploy	Blue-green with health check gates	Switch to green after checks pass; keep blue as instant rollback
Gradual rollout	Canary with metric comparison	5% traffic to new version; compare errors and latency; promote or rollback
Data safety	Define RPO/RTO, implement accordingly	RPO 1 hour = hourly backups; RTO 5 min = automated failover

See: references/reliability-operations.md

Common Mistakes

Mistake	Why It Fails	Fix
Architecture before requirements	Solves the wrong problem, misses constraints	Spend the first 5-10 minutes on scope: features, scale, SLA
No estimation	Provisioning off by orders of magnitude	Estimate QPS, storage, bandwidth before choosing components
Single point of failure	One component takes down the system	Redundancy at every layer: multi-server, multi-AZ, multi-region
Premature sharding	Huge operational complexity before it's needed	Vertical first, read replicas, cache aggressively, shard last
Caching without invalidation	Stale data causes bugs and confusion	Define TTL; cache-aside with explicit invalidation on writes
Synchronous calls everywhere	One slow service cascades latency to all callers	Queues for non-latency-critical paths; timeouts on sync calls
Ignoring hotspots	One shard or key hammered, others idle	Detect hot keys; add secondary partitioning or local caches
No monitoring or alerting	Users find failures before you do	Instrument metrics, logs, and traces from day one

Quick Diagnostic

Question	If No	Action
Are functional and non-functional requirements listed?	Design rests on assumptions	Write down features, DAU, QPS, storage, latency and availability SLAs
Is there a QPS and storage estimate?	Capacity is a guess	DAU x actions / 86400 for QPS; records x size x retention for storage
Is every component redundant?	Single points of failure	Add replicas, failover, or multi-AZ per component
Is the database scaling strategy defined?	You hit a wall under growth	Vertical first, then read replicas, then sharding with a clear shard key
Is there a cache for read-heavy paths?	Database takes unnecessary load	Redis/Memcached cache-aside with defined TTL
Are async paths using queues?	Tight coupling, cascading failures	Decouple with Kafka/SQS for jobs, notifications, analytics
Is there a monitoring and alerting plan?	Blind to production failures	Define metrics, log aggregation, tracing, alert thresholds
Is the deployment strategy defined?	Risky all-at-once releases	Rolling, blue-green, or canary with automated rollback

Reference Files

four-step-process.md: The complete four-step process with time allocation, example questions, and tips per stage
estimation-numbers.md: Powers of two, latency numbers, availability nines, QPS/storage/bandwidth worked examples
building-blocks.md: DNS, CDN, load balancers, caching strategies, message queues, consistent hashing
database-scaling.md: SQL vs NoSQL, replication, sharding strategies, denormalization, selection guide
common-designs.md: URL shortener, rate limiter, news feed, chat, autocomplete, web crawler, unique ID generator
reliability-operations.md: Health checks, monitoring, logging, deployment strategies, disaster recovery, autoscaling

About the Author

Alex Xu is a software engineer who previously worked at Twitter, Apple, and Oracle, and the creator of ByteByteGo. His two-volume System Design Interview series, with over 500,000 copies sold, turned system design into a learnable, repeatable skill through structured thinking, estimation, and clear communication.

system-design

System Design Framework

Core Principle

Scoring

The System Design Framework

1. The Four-Step Process

2. Back-of-the-Envelope Estimation

3. Building Blocks

4. Database Design and Scaling

5. Common System Designs

6. Reliability and Operations

Common Mistakes

Quick Diagnostic

Reference Files

Further Reading

About the Author