Free Tool

Incident Postmortem Template

Fill in the details below. The postmortem document generates in real time. Copy to clipboard or download as Markdown, HTML, or plain text.

Incident Overview

Incident Title

Date

Duration

Severity

P1 – All users affected

Document Status

Affected Services

Incident Commander

Participants (optional)

Impact

Users Affected

Revenue Impact (optional)

SLA breach occurred

Summary

Executive Summary

Timeline

All times in UTC

Root Cause

Technical Root Cause

Good root cause: "A configuration change deployed at 14:32 UTC set the database connection pool limit to 5 instead of 50, causing connection exhaustion under normal traffic within 3 minutes." Bad root cause: "An engineer made a mistake."

Contributing Factors

System conditions that allowed the root cause to cause an outage

What Went Well

What worked as intended during the incident

What Needs Improvement

What could have gone better or went worse than expected

Action Items

Specific, assigned, dated - not vague intentions

"Improve monitoring" is not an action item. "Add connection pool saturation alert with 90% threshold - @alice - 2026-07-03 - High" is.

Generated Postmortem

Copy or download as .md, .html, or .txt

# Incident Postmortem: Production API Outage

| | |
|---|---|
| **Date** | 2026-06-26 |
| **Severity** | P1 |
| **Status** | Draft |
| **Duration** | [e.g. 1h 23m] |
| **Affected services** | API |

## Impact

- **Users affected:** [Number or % of users who experienced impact]
- **Duration:** [Total time from first failure to full recovery]
- **Estimated revenue impact:** [Optional]

## Summary

[2-3 sentences: what failed, how many users were affected, how it was resolved, and how long it lasted.]

## Timeline

*All times UTC*

| Time | Event |
|------|-------|
| --:-- | First failure detected |
| --:-- | Alert fired, on-call engineer paged |
| --:-- | Incident declared, team assembled |
| --:-- | Root cause identified |
| --:-- | Fix deployed |
| --:-- | Service restored, incident closed |

## Root Cause

[Describe the specific technical condition that caused the failure. Name the system, the failure mode, and what allowed it to happen. "Human error" is not a root cause - describe the system condition that made the human action possible.]

## Contributing Factors

- [No alerting on the specific failure mode]
- [No runbook for this scenario]
- [Insufficient load testing for this failure condition]

## What Went Well

- [Detection was fast once the monitor fired]
- [Team assembled quickly and communication was clear]

## What Needs Improvement

- [Root cause identification took longer than expected]
- [Customer communication was delayed after detection]

## Action Items

| Action | Owner | Due date | Priority |
|--------|-------|----------|----------|
| Add alerting for [specific failure mode] | @owner | [date] | High |
| Write runbook for this scenario | @owner | [date] | High |
| Add regression test | @owner | [date] | Medium |

---

*Drafted 2026-06-26. Review and publish within 48 hours of resolution.*

What makes a postmortem useful

A postmortem is only valuable if it prevents the next incident of the same type. Most postmortems fail at this because they stop at symptoms instead of reaching systemic causes, or they produce vague action items that never get completed.

The sections in this template follow the blameless postmortem model, developed at Google and adopted across most major engineering organizations. Blameless means the goal is to understand the system conditions that allowed a failure, not to identify who made a mistake. People make mistakes in systems that permit those mistakes to cascade; fixing the system prevents recurrence. Fixing the person doesn't.

The action items section is where most postmortems collapse. "Improve monitoring" assigned to no one with no due date gets closed as "done" at the next sprint retro without anything changing. Every action item in this template requires a specific description, a named owner, and a due date. If you can't fill those in, the action item isn't ready to be written down yet.

The five sections that matter most

Timeline

Reconstruct the exact sequence of events with timestamps. The timeline is the most labor-intensive section and the most valuable. It reveals detection gaps, decision points, and the actual sequence of events versus what people remembered. Source it from monitoring alerts, deployment logs, Slack timestamps, and on-call acknowledgment records.

Root cause

Name the specific technical condition that allowed the failure. The test for a good root cause: if you fix only this, does the same incident recur? If yes, you have a symptom, not a root cause. If the answer is "the database ran out of connections," ask why the connection limit was set too low and why there was no alerting on connection pool saturation. Keep asking why until you reach a fixable system condition.

Contributing factors

These are the conditions that transformed a minor fault into a significant outage. Examples: no alerting on the specific failure mode, no runbook for this scenario, an incomplete code review that missed the edge case, insufficient staging environment parity. Contributing factors are where the highest-leverage fixes live.

What went well

Document what worked as intended. Fast detection, clear communication, a runbook that was actually followed. This section matters because it tells you what to protect when you make changes. Teams that skip it often accidentally break the things that were working correctly.

Action items

Each action item needs three things: a specific description of what will change, a named owner, and a due date. Track completion at your next incident review. A postmortem with no completed action items after 30 days is a postmortem that didn't work.

When to write a postmortem

Incident type	Postmortem?	Audience
P1: full outage, all users affected	Required	Full team + leadership; consider publishing externally
P2: major feature broken, significant user impact	Required	Engineering team; internal only
P3: minor feature broken, small subset of users	Recommended	Affected team; internal only
P4: internal tools, no user impact	Optional	Team lead; internal only
Any incident that recurred within 30 days	Required	Engineering team; review previous postmortem

Detect incidents faster - write fewer postmortems

Vantaj monitors your services from multiple global regions and alerts your team within seconds of a confirmed failure. Shorter detection time means shorter incident duration and smaller postmortem scope.

Start monitoring free