incident-postmortem

Category: Coding Risk: Medium risk ★ 4.6 · Rating 4.6/5 (1014) mohitagw15856/pm-claude-skills MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_access

name: incident-postmortem
description: "Write a structured incident postmortem or post-incident review. Use when asked to write a postmortem, incident report, P1/P2 review, outage report, or RCA (root cause analysis). Generates a blameless postmortem with timeline, root cause, contributing factors, impact summary, and action items."

Incident Postmortem Skill

This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.

Proposes Actions

The action items don't have to stay on the page: hand them to action-runner, which previews them (dry-run, risk-rated), runs only what you approve via the connected action MCP, and records what was done back to the brain. Typical: file a follow-up issue per action item (🟡), assigned to its owner with a due date. This skill proposes; action-runner gates and runs — never silently.

Required Inputs

Ask the user for these if not provided:

  • Incident title / ID
  • Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
  • Date and duration of the incident
  • What happened (rough notes are fine — the skill will structure them)
  • Services or systems affected
  • Customer impact (how many users, what was degraded)
  • How it was detected
  • How it was resolved
  • Initial thoughts on root cause
  • Action items already identified (optional)
  • Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
  • Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)

Output Format


Incident Postmortem: [Incident Title]

Incident ID: [ID]
Severity: [P1/P2/P3]
Date: [Date]
Duration: [Start time → Resolution time — total duration]
Status: [Resolved / Monitoring / Ongoing]
Author: [Leave blank for user to fill]
Last updated: [Date]


Executive Summary

[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]


Impact

Dimension Details
Users affected [Number or percentage]
Services degraded [List affected services]
Business impact [Revenue, SLA breach, support tickets, etc. if known]
Duration [Total time from first detection to full resolution]

Timeline

List events in chronological order. Each entry: [HH:MM UTC] — [What happened. Who did what. What changed.]

Rules for timeline entries:

  • Use passive or system-focused language — avoid "X made a mistake"
  • Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
  • Note time between key events (e.g. "22 minutes between detection and escalation")

Root Cause

Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]

Contributing factors:

  • [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
  • [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
  • [Factor 3 — add as many as are relevant]

Why did our existing safeguards not prevent this?
[Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]


Detection

  • How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
  • Time from incident start to detection: [X minutes]
  • Should we have detected this faster? [Yes / No — and why]

Resolution

What fixed it? [Clear description of the actual fix — one paragraph]
Why did this work? [Brief technical explanation]
Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]


Action Items

# Action Owner Due Date Priority
1 [Specific, testable action] [Team or person] [Date] P1/P2/P3

Rules for action items:

  • Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
  • Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
  • Assign a real owner — not "team" or "TBD" if avoidable
  • Flag P1 actions as items that block the incident from being marked fully closed

What Went Well

[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]


Lessons Learned

[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]


Communication Log

[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]


Quality Checks

  • Timeline has no blame-focused language
  • Root cause is specific (not "human error")
  • Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
  • Contributing factors explain the systemic gaps
  • Every action item has an owner and due date
  • "What went well" section is genuine, not token
  • No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
  • Executive summary is readable by non-technical leadership

Anti-Patterns

  • Do not assign blame to individuals — postmortems must focus on system and process failures
  • Do not write action items with vague language like "improve monitoring" — each must name a specific, ownable change
  • Do not skip the contributing factors — root cause alone misses the systemic issues that enable incidents
  • Do not omit the detection timeline — how long it took to detect matters as much as how long it took to resolve
  • Do not treat the postmortem as closed until all action items have named owners and due dates

Usage Examples

  • "Write a postmortem for the [incident name] outage"
  • "Help me write a P1 incident report"
  • "Generate an RCA document for [service] going down on [date]"
  • "Draft a blameless postmortem from these notes: [paste notes]"