How to Write a Postmortem That Prevents Repeat Incidents

Most postmortems fail because teams document the outage, then skip systemic fixes. A useful postmortem changes engineering behavior after the incident ends.

Use this guide if you want postmortems that reduce repeat failures, lower MTTR, and improve on-call quality.

What a strong postmortem includes

Incident summary: start time, end time, severity, impacted systems.
Customer impact: who was affected, for how long, and how badly.
Detection path: monitor, customer report, or internal observation.
Timeline: UTC-stamped sequence from first symptom to full recovery.
Root cause: single initiating failure.
Contributing factors: controls that failed to prevent blast radius.
Actions: owner, due date, and measurable success condition.

If any of these are missing, teams cannot learn reliably from the incident.

Postmortem quality rubric

Section	Weak	Strong
Summary	"Service had issues"	"Checkout API returned 5xx for 27 minutes; 14.2% checkout attempts failed"
Root cause	"Human error"	"Deploy pipeline allowed unsafe schema change without migration guard"
Contributors	Generic list	Technical + process factors tied to timeline evidence
Actions	"Improve monitoring"	"Add synthetic checkout monitor in 3 regions; owner: SRE lead; due: Jul 15"
Follow-up	No review date	Action status reviewed in weekly ops review

7-step process to write the postmortem

1) Start with impact, not internals

Write customer and business impact first. This aligns engineering, product, and support before debating implementation details.

2) Build timeline from evidence

Use monitor events, logs, deploy history, and incident chat timestamps. Avoid reconstructing from memory.

3) Separate root cause from contributors

Root cause: the initiating failure.
Contributors: missing safeguards, alerting gaps, or process failures.

4) Quantify detection and recovery

Include:

MTTD (mean time to detect)
MTTA (mean time to acknowledge)
MTTR (mean time to resolve)

5) Make actions measurable

Each action should include:

owner
due date
success metric
risk reduced

6) Run a blameless language pass

Replace blame language with mechanism language. Name failing systems and decisions, not individual people.

7) Track closure, not publication

A published postmortem without shipped actions is a status artifact, not an improvement loop.

Benchmarks for SaaS teams

Metric	Strong	Acceptable	Needs work
MTTD	< 2 min	2 to 5 min	> 10 min
MTTR	< 30 min	30 to 90 min	> 2 hours
Postmortem publication	< 48 hours	2 to 5 days	> 7 days
30-day action closure	> 80%	60 to 80%	< 60%

These thresholds help teams turn reliability conversations into measurable targets.

Copy-ready incident postmortem template

# Incident Postmortem: [Title]

## 1) Summary
- Start: [UTC]
- End: [UTC]
- Duration: [minutes]
- Severity: [SEV1/SEV2/...]
- Services impacted: [list]

## 2) Customer Impact
- Affected users: [segment / %]
- User symptoms: [what they saw]
- Business impact: [errors, revenue, support volume]

## 3) Detection and Response Metrics
- MTTD: [minutes]
- MTTA: [minutes]
- MTTR: [minutes]
- Detection source: [monitoring/customer/internal]

## 4) Timeline (UTC)
- [time] [event]
- [time] [event]

## 5) Root Cause
[one specific statement]

## 6) Contributing Factors
- [factor]
- [factor]

## 7) What Worked / What Failed
- Worked: [list]
- Failed: [list]

## 8) Action Items
- [action] | Owner: [name] | Due: [date] | Success metric: [metric]

## 9) Follow-up Review Date
- [date]

Common mistakes to avoid

Publishing without action owners.
Using generic wording that hides mechanisms.
Ignoring process contributors and focusing only on code bugs.
Skipping closure tracking for action items.

/blog/incident-postmortem-template
/blog/mttr-mttd-mtbf-incident-metrics
/blog/incident-response-checklist-startups
/blog/website-outage-response-runbook
/blog/how-to-communicate-during-service-outage

Stop-Slop check

Direct language, no filler intros.
Active voice and clear actors.
Concrete metrics and thresholds.
Actionable template with owners and due dates.