ddia-systems

Category: Coding Risk: Medium risk ★ 4.7 · Rating 4.7/5 (1434) wondelai/skills MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessfilesystem_access

name: ddia-systems
description: 'Design data systems by understanding storage engines, replication, partitioning, transactions, and consistency models. Use when the user mentions "database choice", "replication lag", "partitioning strategy", "consistency vs availability", "stream processing", "ACID transactions", "eventual consistency", or "LSM tree vs B-tree". Also trigger when choosing between SQL and NoSQL, designing data pipelines, or debugging distributed system consistency issues. Covers data models, batch/stream processing, and distributed consensus. For system design, see system-design. For resilience, see release-it.'
license: MIT
metadata:
author: wondelai
version: "1.2.0"

Designing Data-Intensive Applications Framework

A principled approach to building reliable, scalable, and maintainable data systems. Apply these principles when choosing databases, designing schemas, architecting distributed systems, or reasoning about consistency and fault tolerance.

Core Principle

Data outlives code. Applications are rewritten and frameworks come and go, but data persists for decades -- prioritize the long-term correctness, durability, and evolvability of the data layer. Most applications are data-intensive, not compute-intensive: the hard problems are data volume, complexity, and rate of change, and explicit consistency/availability/latency trade-offs separate robust systems from fragile ones.

Scoring

Goal: 10/10. Rate any data architecture 0-10 against the principles below: deliberate trade-off choices for data models, storage, replication, partitioning, transactions, and pipelines score high; accidental complexity and ignored failure modes score low. Report the current score and the improvements needed to reach 10/10.

The DDIA Framework

Seven domains for reasoning about data-intensive systems:

1. Data Models and Query Languages

Core concept: The data model shapes how you think about the problem. Relational, document, and graph models each impose different constraints and enable different query patterns.

Why it works: Choosing the wrong data model forces application code to compensate for representational mismatch, adding accidental complexity that compounds over time.

Key insights:

  • Relational models excel at many-to-many relationships and ad-hoc queries; document models at one-to-many relationships and locality; graph models at recursive traversals over interconnected data
  • Schema-on-write (relational) catches errors early; schema-on-read (document) offers flexibility
  • Polyglot persistence -- different stores for different access patterns -- is often the right answer
  • Object-relational impedance mismatch is a real cost; document models reduce it for self-contained aggregates

Code applications:

Context Pattern Example
User profiles with nested data Document model for self-contained aggregates Profile, addresses, and preferences in one MongoDB document
Social network connections Graph model for relationship traversal Neo4j Cypher: MATCH (a)-[:FOLLOWS*2]->(b) for friend-of-friend
Financial ledger with joins Relational model for referential integrity PostgreSQL foreign keys between accounts, transactions, entries

See: references/data-models.md for relational/document/graph trade-offs and query language comparisons.

2. Storage Engines

Core concept: Storage engines trade off read performance against write performance. Log-structured engines (LSM trees) optimize writes; page-oriented engines (B-trees) balance reads and writes.

Why it works: Understanding your database's storage engine lets you predict performance characteristics, choose appropriate indexes, and avoid pathological workloads.

Key insights:

  • LSM trees: append-only writes, periodic compaction, excellent write throughput, higher read amplification
  • B-trees: in-place updates, predictable read latency, write amplification from page splits
  • Write amplification (one logical write causing multiple physical writes) matters for SSDs with limited write cycles
  • Column-oriented storage dramatically improves analytical queries through compression and vectorized processing
  • In-memory databases are fast because they avoid encoding overhead, not because they avoid disk

Code applications:

Context Pattern Example
High write throughput LSM-tree engine Cassandra or RocksDB for time-series ingestion at 100K+ writes/sec
Mixed read/write OLTP B-tree engine PostgreSQL B-tree indexes for transactional point lookups
Analytical queries Column-oriented storage ClickHouse or Parquet for scanning billions of rows, few columns

See: references/storage-engines.md for LSM vs B-tree internals, compaction, and column storage.

3. Replication

Core concept: Replication keeps copies of data on multiple machines for fault tolerance, scalability, and latency reduction. The core challenge is handling changes consistently.

Why it works: Every replication strategy trades off consistency, availability, and latency. Making the trade-off explicit prevents subtle anomalies that surface only under load or failure.

Key insights:

  • Single-leader: simple, strong consistency possible, but the leader is a bottleneck and single point of failure
  • Multi-leader: better write availability across data centers, but complex conflict resolution
  • Leaderless: highest availability via quorum reads/writes, but needs careful conflict handling
  • Replication lag causes read-your-writes, monotonic-read, and causality violations
  • Synchronous replication guarantees durability but adds latency; asynchronous risks data loss on failover
  • CRDTs and last-writer-wins resolve conflicts with very different correctness guarantees

Code applications:

Context Pattern Example
Read-heavy web app Single-leader with read replicas PostgreSQL primary + read replicas behind pgBouncer
Multi-region writes Multi-leader replication CockroachDB or Spanner with bounded staleness
Shopping cart availability Leaderless with merge DynamoDB with last-writer-wins or application-level cart merge

See: references/replication.md for lag anomalies, conflict resolution, and CRDTs.

4. Partitioning

Core concept: Partitioning (sharding) distributes data across nodes so each handles a subset, enabling horizontal scaling beyond a single machine.

Why it works: Without partitioning, one node bottlenecks storage and throughput. Effective partitioning spreads load evenly and avoids hotspots.

Key insights:

  • Key-range partitioning supports efficient range scans but risks hotspots on sequential keys
  • Hash partitioning distributes load evenly but destroys sort order, making range queries expensive
  • Local secondary indexes require scatter-gather queries; global secondary indexes require cross-partition updates
  • Hotspots occur even with hashing when a single key is extremely popular (celebrity problem)
  • Rebalancing strategies: fixed partition count, dynamic splitting, or proportional to nodes

Code applications:

Context Pattern Example
Time-series data Key-range partitioning by time + source Partition by (sensor_id, date) to avoid current-day write hotspot
User data at scale Hash partitioning on user ID Cassandra consistent hashing on user_id for even distribution
Celebrity/hot-key problem Key splitting with random suffix Append random digit to hot key, fan out reads across 10 sub-partitions

See: references/partitioning.md for rebalancing, request routing, and secondary index strategies.

5. Transactions and Consistency

Core concept: Transactions provide safety guarantees (ACID) that simplify application code by letting you pretend failures and concurrency don't exist -- within the transaction's scope.

Why it works: Without transactions, every piece of application code must handle partial failures, races, and concurrent modification. Transactions move that complexity into the database, handled correctly once.

Key insights:

  • Isolation levels are a spectrum: read uncommitted, read committed, snapshot isolation, serializable
  • Most databases default to read committed or snapshot isolation -- NOT serializable -- so you must understand the anomalies this permits
  • Write skew: two transactions read the same data, decide, and write different records -- no row lock prevents it
  • Serializable snapshot isolation (SSI) gives full serializability optimistically: no blocking, but aborts on conflict; two-phase locking blocks and deadlocks under contention
  • Distributed transactions (two-phase commit) are expensive and fragile; design around single-partition operations instead

Code applications:

Context Pattern Example
Account balance transfer Serializable transaction BEGIN; UPDATE accounts ... -100 WHERE id=1; UPDATE accounts ... +100 WHERE id=2; COMMIT;
Inventory reservation SELECT FOR UPDATE to prevent write skew SELECT stock FROM items WHERE id = X FOR UPDATE before decrementing
Cross-service operations Saga instead of distributed transaction Charge card, reserve inventory; on failure, run compensating refund

See: references/transactions.md for isolation-level anomalies and serializability techniques.

6. Batch and Stream Processing

Core concept: Batch processing transforms bounded datasets in bulk; stream processing transforms unbounded event streams continuously. Both compute derived data.

Why it works: Separating the system of record from derived data (caches, indexes, materialized views) lets each be optimized independently and rebuilt from source when requirements change.

Key insights:

  • MapReduce is conceptually simple but operationally awkward; dataflow engines (Spark, Flink) generalize it with arbitrary DAGs
  • Change data capture (CDC) turns database writes into a stream downstream systems can consume
  • Stream-table duality: a stream is the changelog of a table; a table is the materialized state of a stream
  • Exactly-once semantics require idempotent operations or transactional output
  • Time windowing (tumbling, hopping, session) is essential for aggregating unbounded streams

Code applications:

Context Pattern Example
Daily analytics pipeline Batch processing with Spark Read day's events from S3, aggregate, write to warehouse
Real-time fraud detection Stream processing with Flink Kafka payment events, rules over 5-second tumbling windows
Syncing search index Change data capture Debezium captures PostgreSQL WAL, Kafka feeds Elasticsearch
Audit trail / event replay Event sourcing Store OrderPlaced, OrderShipped events; rebuild state by replaying

See: references/batch-stream.md for dataflow engines, CDC, and exactly-once semantics.

7. Reliability and Fault Tolerance

Core concept: Faults are inevitable; failures are not. A reliable system continues operating correctly even when individual components fail. Design for faults, not against them.

Why it works: Hardware fails, software has bugs, humans make mistakes. Systems that assume perfect operation are brittle; systems that expect faults are resilient.

Key insights:

  • A fault is one component deviating from spec; a failure is the whole system stopping -- fault tolerance prevents the former becoming the latter
  • Hardware faults are random and independent; software faults are correlated and systematic (more dangerous)
  • Human error is the leading cause of outages -- minimize opportunity for mistakes, maximize ability to recover
  • Timeouts are the fundamental fault detector, but tuning is hard: too short causes false positives, too long delays recovery
  • Safety properties (nothing bad happens) must always hold; liveness (something good eventually happens) may be temporarily violated
  • Byzantine fault tolerance is rarely needed outside blockchain; assume crash-stop or crash-recovery

Code applications:

Context Pattern Example
Service communication Timeouts + retries with backoff retry(max=3, backoff=exponential(base=1s, max=30s)) with jitter
Leader election Consensus algorithm (Raft/Paxos) etcd or ZooKeeper for distributed locks and leader election
Graceful degradation Circuit breaker Resilience4j: open circuit after 50% failures in 10-second window

See: references/fault-tolerance.md for consensus, timeout tuning, and safety/liveness guarantees.

Common Mistakes

Mistake Why It Fails Fix
Choosing a database by popularity Engines have fundamentally different trade-offs Match storage engine to actual read/write patterns
Ignoring replication lag Stale reads, phantom reads, lost updates Implement read-your-writes and monotonic-read guarantees
Distributed transactions everywhere 2PC is slow, fragile; coordinator is a SPOF Design single-partition operations; use sagas across services
Hash partitioning everything Destroys range query ability Key-range partitioning for time-series; composite keys for locality
Assuming serializable isolation Defaults are weaker; write skew appears in production Check the actual default; use explicit locking where needed
Conflating batch and stream Wrong tool adds latency or wasted complexity Match processing model to data boundedness and latency needs
Treating all faults as recoverable Corruption and Byzantine faults need different handling Classify faults; design a recovery strategy per class

Quick Diagnostic

Question If No Action
Can you explain why you chose this database over alternatives? Choice was familiarity, not requirements Evaluate data model fit, read/write ratio, consistency needs, scaling path
Do you know your database's default isolation level? Latent concurrency bugs Check docs; test for write skew and phantom reads
Is your replication strategy explicitly chosen? Implicit consistency/durability assumptions Document sync vs async, failover behavior, lag tolerance
Can your system handle a hot partition key? One popular entity can down the cluster Add key-splitting or load shedding for hot keys
Do you separate system of record from derived data? Every change requires migrating everything Introduce CDC or event sourcing to decouple
Are timeouts and retries tuned, not defaulted? Cascading failures or needless delays Measure p99; set timeouts above p99, below cascade threshold
Have you tested failover in production conditions? Recovery plan is theoretical Run chaos experiments: kill leaders, partition networks, fill disks

Reference Files

  • data-models.md: Relational vs document vs graph models, schema-on-read vs schema-on-write, query languages, polyglot persistence
  • storage-engines.md: LSM trees vs B-trees, write amplification, compaction, column-oriented storage, in-memory databases
  • replication.md: Single-leader, multi-leader, leaderless replication, replication lag, conflict resolution, CRDTs
  • partitioning.md: Key-range vs hash partitioning, secondary indexes, rebalancing, request routing, hotspots
  • transactions.md: ACID, isolation levels, write skew, two-phase locking, SSI, distributed transactions
  • batch-stream.md: MapReduce, dataflow engines, event sourcing, CDC, stream-table duality, exactly-once semantics
  • fault-tolerance.md: Faults vs failures, reliability metrics, timeouts, consensus, safety and liveness guarantees

Further Reading

For the complete treatment with detailed diagrams and research references:

About the Author

Martin Kleppmann is a distributed-systems researcher at the University of Cambridge and a former engineer at LinkedIn and Rapportive, known for his work on CRDTs and local-first software. His book Designing Data-Intensive Applications (2017) is the definitive reference for engineers building data systems, praised for making distributed-systems concepts accessible and practical.