How to Write a Postmortem That Prevents Repeat Incidents
A practical guide to writing blameless incident postmortems with a clear structure, quality rubric, KPI benchmarks, and a copy-ready template your team can use today.
Most postmortems fail because teams document the outage, then skip systemic fixes. A useful postmortem changes engineering behavior after the incident ends.
Use this guide if you want postmortems that reduce repeat failures, lower MTTR, and improve on-call quality.
What a strong postmortem includes
- Incident summary: start time, end time, severity, impacted systems.
- Customer impact: who was affected, for how long, and how badly.
- Detection path: monitor, customer report, or internal observation.
- Timeline: UTC-stamped sequence from first symptom to full recovery.
- Root cause: single initiating failure.
- Contributing factors: controls that failed to prevent blast radius.
- Actions: owner, due date, and measurable success condition.
If any of these are missing, teams cannot learn reliably from the incident.
Postmortem quality rubric
| Section | Weak | Strong |
|---|---|---|
| Summary | "Service had issues" | "Checkout API returned 5xx for 27 minutes; 14.2% checkout attempts failed" |
| Root cause | "Human error" | "Deploy pipeline allowed unsafe schema change without migration guard" |
| Contributors | Generic list | Technical + process factors tied to timeline evidence |
| Actions | "Improve monitoring" | "Add synthetic checkout monitor in 3 regions; owner: SRE lead; due: Jul 15" |
| Follow-up | No review date | Action status reviewed in weekly ops review |
7-step process to write the postmortem
1) Start with impact, not internals
Write customer and business impact first. This aligns engineering, product, and support before debating implementation details.
2) Build timeline from evidence
Use monitor events, logs, deploy history, and incident chat timestamps. Avoid reconstructing from memory.
3) Separate root cause from contributors
- Root cause: the initiating failure.
- Contributors: missing safeguards, alerting gaps, or process failures.
4) Quantify detection and recovery
Include:
- MTTD (mean time to detect)
- MTTA (mean time to acknowledge)
- MTTR (mean time to resolve)
5) Make actions measurable
Each action should include:
- owner
- due date
- success metric
- risk reduced
6) Run a blameless language pass
Replace blame language with mechanism language. Name failing systems and decisions, not individual people.
7) Track closure, not publication
A published postmortem without shipped actions is a status artifact, not an improvement loop.
Benchmarks for SaaS teams
| Metric | Strong | Acceptable | Needs work |
|---|---|---|---|
| MTTD | < 2 min | 2 to 5 min | > 10 min |
| MTTR | < 30 min | 30 to 90 min | > 2 hours |
| Postmortem publication | < 48 hours | 2 to 5 days | > 7 days |
| 30-day action closure | > 80% | 60 to 80% | < 60% |
These thresholds help teams turn reliability conversations into measurable targets.
Copy-ready incident postmortem template
# Incident Postmortem: [Title]
## 1) Summary
- Start: [UTC]
- End: [UTC]
- Duration: [minutes]
- Severity: [SEV1/SEV2/...]
- Services impacted: [list]
## 2) Customer Impact
- Affected users: [segment / %]
- User symptoms: [what they saw]
- Business impact: [errors, revenue, support volume]
## 3) Detection and Response Metrics
- MTTD: [minutes]
- MTTA: [minutes]
- MTTR: [minutes]
- Detection source: [monitoring/customer/internal]
## 4) Timeline (UTC)
- [time] [event]
- [time] [event]
## 5) Root Cause
[one specific statement]
## 6) Contributing Factors
- [factor]
- [factor]
## 7) What Worked / What Failed
- Worked: [list]
- Failed: [list]
## 8) Action Items
- [action] | Owner: [name] | Due: [date] | Success metric: [metric]
## 9) Follow-up Review Date
- [date]
Common mistakes to avoid
- Publishing without action owners.
- Using generic wording that hides mechanisms.
- Ignoring process contributors and focusing only on code bugs.
- Skipping closure tracking for action items.
Related blog posts
/blog/incident-postmortem-template/blog/mttr-mttd-mtbf-incident-metrics/blog/incident-response-checklist-startups/blog/website-outage-response-runbook/blog/how-to-communicate-during-service-outage
Stop-Slop check
- Direct language, no filler intros.
- Active voice and clear actors.
- Concrete metrics and thresholds.
- Actionable template with owners and due dates.