Back to blog
Tutorials

How to Write an Incident Postmortem (With Template)

A postmortem that actually prevents the next outage. Here's a step-by-step guide with a ready-to-use template, real examples, and the mistakes that turn postmortems into wasted meetings.

Vantaj Team · June 21, 2026 · 10 min read

The Outage Is Over. Now What?

The service is back up. The alert is resolved. The Slack channel has gone quiet. Everyone goes back to what they were doing - and three weeks later, the same failure happens for the same reason.

This is what happens without a postmortem. The team fixes the symptom but never addresses the cause. The knowledge about what went wrong lives in one engineer's head, and when they're on vacation during the next incident, the team starts from zero.

A postmortem is the practice of documenting what happened, why it happened, and what you're going to do to prevent it from happening again. It's not a blame exercise. It's not a formality. It's the single most effective way to turn an outage into a lasting improvement.

When to Write a Postmortem

Not every incident needs a full postmortem. Writing one for a 2-minute blip caused by a transient network issue is overhead that doesn't produce value.

Write a postmortem when:

  • The incident lasted longer than 15 minutes
  • Customers were visibly affected (support tickets, status page update, social media mentions)
  • The incident involved a failure mode you haven't seen before
  • The root cause wasn't immediately obvious
  • Multiple teams were involved in the response
  • An SLA was breached or credits were issued

Skip the postmortem when:

  • The incident was under 5 minutes and auto-resolved
  • The cause and fix were immediately obvious and already documented
  • No customers were affected

When in doubt, write the postmortem. A 30-minute write-up that prevents a future 2-hour outage is time well spent.

The Postmortem Template

Here's the template. Copy it, fill it in, and share it with your team.


Incident Summary

FieldValue
Incident titleClear, descriptive name - e.g., "API 5xx errors due to database connection pool exhaustion"
DateWhen it happened
DurationTotal time from first failure to confirmed recovery
SeverityCritical / Major / Minor
Detection methodHow was it discovered? Monitoring alert, customer report, internal observation
Time to detectMinutes from first failure to first alert
Services affectedList all affected services, endpoints, or features
Customer impactNumber of users affected, error rates, revenue impact if known
Incident commanderWho led the response

Timeline

Document every significant event, in chronological order. Include timestamps.

Time (UTC)Event
14:00Deployment of v2.4.1 to production
14:03Monitoring detects elevated 5xx error rate on /api/orders
14:04Alert fires in #incidents Slack channel
14:06On-call engineer acknowledges, begins investigation
14:12Root cause identified: new query in v2.4.1 missing an index, causing full table scans
14:14Decision to rollback to v2.4.0
14:18Rollback deployed
14:21Error rate returns to baseline, monitoring confirms recovery
14:22Incident resolved, recovery notification sent

Be precise. "Around 2 PM" isn't useful. "14:03 UTC" is. Your monitoring tool's incident timeline is the best source for accurate timestamps - don't rely on memory.

Root Cause

Explain the technical root cause. Be specific enough that another engineer could understand the failure without having been there.

Bad root cause: "The database was slow."

Good root cause: "Deployment v2.4.1 introduced a new query on the orders table that filtered by customer_id without an index. Under production load (~2,000 queries/min), this caused full table scans that exhausted the database connection pool within 3 minutes. Subsequent requests to any endpoint using the primary database connection returned 503 errors."

The root cause should answer: what specifically broke, why it broke, and why existing safeguards didn't prevent it.

Contributing Factors

Root cause is the direct trigger. Contributing factors are the conditions that allowed it to become an incident:

  • The query wasn't caught in code review because the orders table is small in staging (500 rows vs. 4 million in production)
  • No automated query performance testing in the CI pipeline
  • The database connection pool was sized for normal load with no headroom for query degradation
  • Deployment happened at 14:00 (peak traffic) instead of during a low-traffic window

Contributing factors are where the most valuable action items come from. Fixing the root cause prevents this exact incident. Fixing contributing factors prevents entire categories of incidents.

What Went Well

Every incident response has things that worked. Documenting them reinforces good practices and helps the team see that incident response isn't all failure.

Examples:

  • Monitoring detected the issue within 3 minutes of deployment
  • The on-call engineer acknowledged the alert within 2 minutes
  • The team made the rollback decision quickly instead of debugging in production
  • The status page updated automatically, reducing customer support tickets by an estimated 60%
  • The rollback procedure was documented and worked on the first attempt

What Didn't Go Well

Be honest. This isn't about blame - it's about identifying weak points.

Examples:

  • The deployment wasn't flagged as high-risk despite touching a high-traffic query path
  • There was no pre-deployment performance check against production-scale data
  • The rollback took 4 minutes because the CI pipeline had to rebuild the previous version
  • Two engineers started investigating independently before coordinating, wasting 5 minutes of duplicate effort

Action Items

This is the most important section. Action items are the commitments that prevent this class of incident from recurring. Every action item needs an owner and a deadline - otherwise it becomes a wish list that nobody follows up on.

Action ItemOwnerDeadlineStatus
Add index on orders.customer_id@backend-leadJun 25✅ Done
Add query performance testing to CI pipeline@platform-engJul 15🔲 Open
Increase database connection pool from 20 to 50@infraJun 23✅ Done
Move high-risk deployments to low-traffic windows (before 10 AM)@eng-managerOngoing🔲 Open
Add slow-query alerting (> 500ms) to database monitoring@platform-engJul 1🔲 Open
Pre-build rollback artifacts so rollbacks don't require a CI build@platform-engJul 30🔲 Open

Good action items are:

  • Specific (not "improve database performance")
  • Achievable (not "eliminate all database-related incidents")
  • Measurable (you can verify whether it was done)
  • Assigned to a person, not a team

Bad action items are:

  • "Be more careful" (not actionable)
  • "Test more" (not specific)
  • "Improve monitoring" (not measurable)
  • Unassigned (nobody owns it, nobody does it)

Running the Postmortem Meeting

Schedule It Within 48 Hours

The longer you wait, the less accurate the timeline becomes. Details fade, context is lost, and the urgency to prevent recurrence fades with it. Aim to hold the postmortem review within 1–2 business days of the incident.

Keep It Blameless

Blameless doesn't mean consequence-free. It means focusing on systems and processes rather than individual mistakes.

Blameless: "The deployment pipeline didn't include a performance regression check, so the slow query reached production."

Blame-ful: "The engineer who wrote the query should have known it would be slow."

The first framing leads to a systemic fix (add performance testing). The second leads to engineers being afraid to deploy - which is worse for reliability than the original incident.

If people fear punishment, they'll hide mistakes. If they feel safe, they'll surface problems early. Blameless postmortems are a reliability investment, not a cultural nicety.

Invite the Right People

  • Everyone directly involved in the incident response
  • The engineer who made the change that triggered the incident (they have the most context)
  • The on-call engineer who responded
  • A representative from any other affected team
  • Optionally: engineering leadership (to observe, not to direct)

Keep the group small enough for productive discussion (4–8 people). Larger incidents can have a broader read-out afterward.

Follow a Structure

  1. Review the timeline (10 min) - Walk through what happened chronologically. Fill in gaps, correct timestamps.
  2. Discuss root cause and contributing factors (15 min) - Agree on why it happened. Challenge assumptions.
  3. Discuss what went well (5 min) - Reinforce good practices.
  4. Discuss what didn't go well (10 min) - Identify gaps without assigning blame.
  5. Define action items (15 min) - Assign owners and deadlines. Prioritize by impact.
  6. Schedule follow-up (5 min) - When will action items be reviewed?

Total: about 60 minutes. If the incident was minor, 30 minutes is enough. If it was a major outage, allow 90 minutes.

Common Postmortem Mistakes

Writing It and Forgetting It

The postmortem document isn't the deliverable - the action items are. If nobody follows up on action items, the postmortem was a waste of time. Schedule a follow-up review 2–4 weeks later to check completion status.

Stopping at the Obvious Root Cause

"The database was overloaded" is a symptom, not a root cause. Keep asking "why" until you reach the systemic issue:

  • Why was the database overloaded? → A slow query was introduced.
  • Why wasn't the slow query caught? → No performance testing in CI.
  • Why is there no performance testing? → Nobody has built it yet.
  • Action item: Build query performance testing into the CI pipeline.

The first "why" gives you a patch. The fifth "why" gives you a prevention.

Making It a Blame Session

The moment someone says "who deployed this?" in an accusatory tone, the postmortem stops being productive. People get defensive, information stops flowing, and the real systemic issues go unaddressed. The facilitator's job is to redirect from "who" to "why" and "how."

Too Many Action Items

A postmortem with 15 action items will complete 2 of them. Prioritize ruthlessly. Three high-impact action items that get done are worth more than fifteen that sit in a backlog. Focus on the items that prevent the broadest class of incidents, not just this specific one.

No Severity Calibration

Writing a 3-page postmortem for a 5-minute blip wastes time. A 2-sentence summary for a 4-hour customer-facing outage wastes the learning opportunity. Match the depth of the postmortem to the severity of the incident.

Using Incident Data to Write Better Postmortems

The hardest part of a postmortem is reconstructing an accurate timeline. Memory is unreliable during incidents - stress compresses time, and people remember the order of events differently.

Your monitoring tool is the source of truth. The incident record should include:

  • Exact start time - When the first check failed
  • Detection time - How long between first failure and first alert
  • Affected regions - Was it global or regional?
  • Duration - Precise time from start to confirmed recovery
  • Response time trends - Was performance degrading before the outage? This shows whether the incident was sudden or a gradual decline that could have been caught earlier.

Vantaj logs every incident with an automatic timeline: when it started, which regions were affected, when the alert fired, and when the service recovered. This gives your postmortem an accurate, timestamped foundation - no guesswork, no conflicting recollections.

The Postmortem Habit

Teams that write postmortems consistently see their MTTR decrease over time. Not because each postmortem is revolutionary, but because the cumulative effect of dozens of small improvements - better runbooks, faster rollbacks, tighter monitoring, improved alerting - compounds into a fundamentally more resilient system.

The best time to write your first postmortem was after your last incident. The second best time is after the next one. Start with the template above, keep it blameless, follow up on the action items, and make it a habit. Your future on-call team will thank you.