release-it

Category: Design Risk: Medium risk ★ 4.7 · Rating 4.7/5 (1434) wondelai/skills MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessfilesystem_access

name: release-it
description: 'Build production-ready systems with stability patterns: circuit breakers, bulkheads, timeouts, and retry logic. Use when the user mentions "production outage", "circuit breaker", "timeout strategy", "deployment pipeline", "chaos engineering", "bulkhead pattern", "retry with backoff", or "health checks". Also trigger when designing resilient microservices, planning zero-downtime deployments, or investigating cascading failure scenarios. Covers capacity planning, health checks, and anti-fragility patterns. For data systems, see ddia-systems. For system architecture, see system-design.'
license: MIT
metadata:
author: wondelai
version: "1.2.0"

Release It! Framework

Framework for designing, deploying, and operating production-ready software. The software that passes QA is not the software that survives production — production is hostile, and systems must expect and handle failure at every level.

Core Principle

Every system will eventually be pushed beyond its design limits. The question is not whether failures happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct — it is resilient, observable, and operates through partial failures without human intervention.

Scoring

Goal: 10/10. When reviewing or creating production systems, rate them 0-10 against the principles below — 10/10 means full alignment, lower scores indicate gaps. Always give the current score and the specific improvements needed to reach 10/10.

The Release It! Framework

Six areas that determine whether software survives contact with production:

1. Stability Anti-Patterns

Core concept: Failures propagate through integration points and cascade across system boundaries. The most dangerous patterns are not bugs in your code — they are emergent behaviors when systems interact under stress.

Why it works: Every production outage traces back to one or more of these predictable, recurring patterns; recognizing them lets you eliminate the cracks before production traffic finds them.

Key insights:

Integration points are the number-one killer — every socket, HTTP call, or queue is a risk
Slow responses are worse than no response: they tie up threads, exhaust pools, and propagate delay up the call chain
Unbounded result sets turn a harmless query into an out-of-memory crash once data outgrows test assumptions
Users generate load no test predicts — bots, retry storms, flash crowds; self-denial attacks happen when your own marketing overwhelms your infrastructure
Blocked threads are the silent killer — deadlocks and contention show no errors until everything stops

Code applications:

Context	Pattern	Example
HTTP calls	Assume every remote call can fail, hang, or return garbage	Wrap all external calls with timeout + circuit breaker
Database queries	Enforce result set limits	Add `LIMIT`; paginate all list endpoints
Thread pools	Isolate pools per dependency	Separate pool for payment gateway vs. search
Marketing events	Coordinate launches with capacity planning	Pre-scale before Black Friday; queue coupon redemptions

See: references/anti-patterns.md for each anti-pattern with failure scenarios and detection strategies.

2. Stability Patterns

Core concept: Counter each anti-pattern with a stability pattern: circuit breakers stop cascades, bulkheads isolate blast radius, timeouts reclaim stuck resources. Together they make a system bend under load instead of breaking.

Why it works: These patterns accept failure as inevitable and design the response to it — a circuit breaker that trips is the system working correctly, protecting itself from a downstream failure.

Key insights:

Circuit Breaker: three states (closed, open, half-open) — trips after threshold failures, periodically tests recovery
Timeouts: every outbound call needs connect AND read timeouts, propagated up the call chain
Retry with exponential backoff + jitter prevents thundering herd on recovery
Fail Fast: reject requests you know will fail instead of wasting resources; Handshaking lets the server decline work before it's sent
Steady State: systems accumulate cruft (logs, sessions, temp files) — design automatic cleanup
Let It Crash: a clean restart often beats limping along in an unknown state

Code applications:

Context	Pattern	Example
Service calls	Circuit Breaker	Open after 5 failures in 60s; half-open after 30s
Resource isolation	Bulkhead	Dedicated connection pools for critical vs. non-critical
Network calls	Timeout with propagation	Connect 1s, read 5s; propagate deadline downstream
Retries	Backoff + jitter + budget	Base 100ms, max 3 retries, 20% fleet retry budget
Data cleanup	Steady State	Purge sessions >24h; rotate logs at 500MB

See: references/stability-patterns.md for state machines, threshold tuning, and pattern combinations.

3. Capacity and Availability

Core concept: Capacity is not one number — it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and threads. Capacity planning means knowing which resource bottlenecks first, and at what load.

Why it works: Untested systems fail at peak load — the worst possible moment. Knowing actual (not theoretical) limits lets you set realistic SLAs and scale before users hit the wall.

Key insights:

Test taxonomy: load test (expected traffic), stress test (beyond limits), soak test (sustained, catches leaks), spike test (sudden bursts)
Universal Scalability Law: throughput never scales linearly — contention and coherence costs cause diminishing returns
Pool exhaustion looks identical to a database outage from the application's perspective; size pools from measured concurrency, not defaults
"The cloud is infinitely scalable" is a myth — auto-scaling has lag, cold starts, and hard limits

Code applications:

Context	Pattern	Example
Load testing	Ramp to peak, then 2x, observe degradation	Increase RPS until latency exceeds SLO
Connection pools	Size from measured concurrency	Set pool to P99 active connections + 20% headroom
Soak testing	80% capacity for 24-72 hours	Catch memory/connection/file-handle leaks
Capacity model	Document bottleneck per service	"Service X is memory-bound at 2000 RPS; 4GB per instance"

See: references/capacity-planning.md for testing methodologies, pool management, and scalability modeling.

4. Deployment and Release

Core concept: Deployment (putting code on servers) and release (exposing it to users) are separate operations that should be decoupled — deploy without risk, release with confidence.

Why it works: Most outages are caused by changes. Decoupling lets you deploy to production, verify, and only then route traffic; if something breaks, you roll back the release, not the deployment.

Key insights:

Zero-downtime deployment is non-negotiable: rolling, blue-green, or canary
Feature flags dark-launch code and enable it independently of deployment
Database migrations must be backward-compatible — old and new code run simultaneously during deploys (expand-contract)
Immutable infrastructure: never patch a running server — build a new image, deploy, destroy the old
Rollback must be faster than roll-forward; if rollback takes 30 minutes, you will avoid deploying

Code applications:

Context	Pattern	Example
Deploys	Blue-green with health check gate	Deploy to green; smoke test; swap router
Progressive rollout	Canary with automated rollback	5% traffic to canary; auto-rollback if error rate >1%
Feature launch	Flags with emergency off switch	Ship behind flag; enable for 10%; monitor; ramp
Schema changes	Expand-contract migration	Add column; write both; backfill; drop old

See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code.

5. Health Checks and Observability

Core concept: You cannot operate what you cannot observe. Health checks, metrics, logs, and traces are the sensory organs of your system in production — a first-class design concern, not an afterthought.

Why it works: Production systems fail invisibly without instrumentation. Done right, observability answers questions about your system that you did not anticipate at design time.

Key insights:

Health checks come in two flavors: shallow (process alive) and deep (dependencies reachable, resources available)
Three pillars: structured logs (what happened), metrics (how much), distributed traces (where and how long)
RED method for services: Rate, Errors, Duration; USE method for resources: Utilization, Saturation, Errors
Define SLIs (measure user experience) → SLOs (targets) → SLAs (contracts), in that order
Alert on symptoms users feel (error rate, latency), not causes (CPU); dashboards should answer "is the system healthy?" within 5 seconds

Code applications:

Context	Pattern	Example
Health endpoints	Deep health check	`/health` reports DB, cache, queue, disk status
Service metrics	RED instrumentation	Rate, error rate, p50/p95/p99 latency per endpoint
Distributed tracing	Propagate trace context	Trace ID in headers; correlate logs across services
Alerting	SLO burn rate, not raw thresholds	"Error budget burning 10x" vs. "CPU > 80%"

See: references/observability.md for health check design, SLO frameworks, and alerting strategies.

6. Adaptation and Chaos Engineering

Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify, not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.

Core concept: Confidence in resilience comes from testing under realistic failure conditions. Chaos engineering experiments on a system in a controlled way to build confidence it withstands turbulence.

Why it works: You cannot know how a system handles failure until it actually fails; controlled injection turns unknown-unknowns into known-knowns before they cause real outages.

Key insights:

Define steady state first — you need a measurable baseline to detect deviation
Every experiment has a hypothesis: "We believe that when X fails, the system will Y"
Start small in non-production (kill one process, add latency to one call), then escalate gradually with approvals
Minimize blast radius: canary populations, feature flags, emergency stop; production experiments require explicit authorization and instant rollback
Automate recurring experiments; GameDay exercises test both the system and the team
Build a culture where finding weaknesses is celebrated, not punished

Code applications:

Context	Pattern	Example
Process failure	Controlled termination via chaos tooling	Kill one pod with Gremlin/Litmus; verify recovery within SLO
Network failure	Inject latency/partition via chaos tooling	+500ms on DB calls; verify circuit breaker trips
Dependency failure	Simulate downstream outage via chaos tooling	Return 503 from payment API; verify graceful degradation
GameDay	Scheduled team exercise	"Primary DB goes read-only at 2pm" — practice response

See: references/chaos-engineering.md for experiment design, blast radius management, and building the practice.

Common Mistakes

Mistake	Why It Fails	Fix
No timeouts on outbound calls	One slow dependency freezes the system	Connect and read timeouts on every external call
Unbounded retries	Retry storms amplify failures	Exponential backoff, jitter, fleet-wide retry budgets
Shared thread/connection pools	One failing dependency drains everything	Bulkhead: isolate pools per dependency
Shallow health checks only	Traffic routed to instances with broken dependencies	Deep health checks that verify downstream connectivity
Testing only the happy path	Works perfectly until the first real failure	Load, soak, and chaos test before major releases
Coupling deploy and release	Every deployment is all-or-nothing high risk	Feature flags, canary, blue-green
Alerting on causes, not symptoms	CPU alerts fire while users suffer silently	Alert on user-facing SLIs: errors, latency, availability
No capacity model	System falls over at 2x load	Model bottlenecks; load test to 3x expected peak

Quick Diagnostic

Audit any production system:

Question	If No	Action
Does every outbound call have a timeout?	Calls hang, blocking threads	Add connect and read timeouts everywhere
Are circuit breakers on critical dependencies?	One failure takes down the system	Add breakers with tuned thresholds
Are pools isolated per dependency?	Failures cross-contaminate	Implement bulkheads with dedicated pools
Can you deploy without downtime?	Deployments cause outages	Rolling, blue-green, or canary deployment
Do health checks verify dependencies?	Dead instances receive traffic	Deep health checks testing DB, cache, queue
Are logs, metrics, and traces correlated?	Debugging means manual log searches	Distributed tracing with correlated IDs
Have you load-tested beyond expected peak?	Unknown failure mode under real load	Test to 2-3x peak; document the breaking point
Do you practice failure injection?	Resilience is theoretical	Start chaos engineering with low-risk experiments

Reference Files

anti-patterns.md: Integration point failures, cascading failures, blocked threads, unbounded result sets, self-denial attacks, slow responses
stability-patterns.md: Circuit Breaker, Bulkhead, Timeout, Retry, Fail Fast, Steady State, Let It Crash, Handshaking
capacity-planning.md: Load/stress/soak testing, connection pool sizing, thread pool tuning, Universal Scalability Law
deployment-strategies.md: Blue-green, canary, rolling deploys, feature flags, database migrations, immutable infrastructure
observability.md: Health checks, RED/USE methods, SLIs/SLOs/SLAs, distributed tracing, alerting strategy
chaos-engineering.md: Steady state hypothesis, failure injection, GameDay exercises, blast radius management

About the Author

Michael T. Nygard is a software architect with 30+ years building and operating large-scale production systems handling millions of transactions per day. Release It! (2007; 2nd edition 2018) became a foundational text of the DevOps and site reliability engineering movements, arguing that architects must stay responsible for systems long after the code is written.

release-it

Release It! Framework

Core Principle

Scoring

The Release It! Framework

1. Stability Anti-Patterns

2. Stability Patterns

3. Capacity and Availability

4. Deployment and Release

5. Health Checks and Observability

6. Adaptation and Chaos Engineering

Common Mistakes

Quick Diagnostic

Reference Files

Further Reading

About the Author