Incident Postmortem Template
Fill in the details below. The postmortem document generates in real time. Copy to clipboard or download as Markdown, HTML, or plain text.
Incident Overview
P1 – All users affected
Impact
Summary
Timeline
All times in UTC
Root Cause
Good root cause: "A configuration change deployed at 14:32 UTC set the database connection pool limit to 5 instead of 50, causing connection exhaustion under normal traffic within 3 minutes." Bad root cause: "An engineer made a mistake."
Contributing Factors
System conditions that allowed the root cause to cause an outage
What Went Well
What worked as intended during the incident
What Needs Improvement
What could have gone better or went worse than expected
Action Items
Specific, assigned, dated - not vague intentions
"Improve monitoring" is not an action item. "Add connection pool saturation alert with 90% threshold - @alice - 2026-07-03 - High" is.
Generated Postmortem
Copy or download as .md, .html, or .txt
# Incident Postmortem: Production API Outage | | | |---|---| | **Date** | 2026-06-26 | | **Severity** | P1 | | **Status** | Draft | | **Duration** | [e.g. 1h 23m] | | **Affected services** | API | ## Impact - **Users affected:** [Number or % of users who experienced impact] - **Duration:** [Total time from first failure to full recovery] - **Estimated revenue impact:** [Optional] ## Summary [2-3 sentences: what failed, how many users were affected, how it was resolved, and how long it lasted.] ## Timeline *All times UTC* | Time | Event | |------|-------| | --:-- | First failure detected | | --:-- | Alert fired, on-call engineer paged | | --:-- | Incident declared, team assembled | | --:-- | Root cause identified | | --:-- | Fix deployed | | --:-- | Service restored, incident closed | ## Root Cause [Describe the specific technical condition that caused the failure. Name the system, the failure mode, and what allowed it to happen. "Human error" is not a root cause - describe the system condition that made the human action possible.] ## Contributing Factors - [No alerting on the specific failure mode] - [No runbook for this scenario] - [Insufficient load testing for this failure condition] ## What Went Well - [Detection was fast once the monitor fired] - [Team assembled quickly and communication was clear] ## What Needs Improvement - [Root cause identification took longer than expected] - [Customer communication was delayed after detection] ## Action Items | Action | Owner | Due date | Priority | |--------|-------|----------|----------| | Add alerting for [specific failure mode] | @owner | [date] | High | | Write runbook for this scenario | @owner | [date] | High | | Add regression test | @owner | [date] | Medium | --- *Drafted 2026-06-26. Review and publish within 48 hours of resolution.*
What makes a postmortem useful
A postmortem is only valuable if it prevents the next incident of the same type. Most postmortems fail at this because they stop at symptoms instead of reaching systemic causes, or they produce vague action items that never get completed.
The sections in this template follow the blameless postmortem model, developed at Google and adopted across most major engineering organizations. Blameless means the goal is to understand the system conditions that allowed a failure, not to identify who made a mistake. People make mistakes in systems that permit those mistakes to cascade; fixing the system prevents recurrence. Fixing the person doesn't.
The action items section is where most postmortems collapse. "Improve monitoring" assigned to no one with no due date gets closed as "done" at the next sprint retro without anything changing. Every action item in this template requires a specific description, a named owner, and a due date. If you can't fill those in, the action item isn't ready to be written down yet.
The five sections that matter most
Timeline
Reconstruct the exact sequence of events with timestamps. The timeline is the most labor-intensive section and the most valuable. It reveals detection gaps, decision points, and the actual sequence of events versus what people remembered. Source it from monitoring alerts, deployment logs, Slack timestamps, and on-call acknowledgment records.
Root cause
Name the specific technical condition that allowed the failure. The test for a good root cause: if you fix only this, does the same incident recur? If yes, you have a symptom, not a root cause. If the answer is "the database ran out of connections," ask why the connection limit was set too low and why there was no alerting on connection pool saturation. Keep asking why until you reach a fixable system condition.
Contributing factors
These are the conditions that transformed a minor fault into a significant outage. Examples: no alerting on the specific failure mode, no runbook for this scenario, an incomplete code review that missed the edge case, insufficient staging environment parity. Contributing factors are where the highest-leverage fixes live.
What went well
Document what worked as intended. Fast detection, clear communication, a runbook that was actually followed. This section matters because it tells you what to protect when you make changes. Teams that skip it often accidentally break the things that were working correctly.
Action items
Each action item needs three things: a specific description of what will change, a named owner, and a due date. Track completion at your next incident review. A postmortem with no completed action items after 30 days is a postmortem that didn't work.
When to write a postmortem
| Incident type | Postmortem? | Audience |
|---|---|---|
| P1: full outage, all users affected | Required | Full team + leadership; consider publishing externally |
| P2: major feature broken, significant user impact | Required | Engineering team; internal only |
| P3: minor feature broken, small subset of users | Recommended | Affected team; internal only |
| P4: internal tools, no user impact | Optional | Team lead; internal only |
| Any incident that recurred within 30 days | Required | Engineering team; review previous postmortem |
Detect incidents faster - write fewer postmortems
Vantaj monitors your services from multiple global regions and alerts your team within seconds of a confirmed failure. Shorter detection time means shorter incident duration and smaller postmortem scope.
Start monitoring free