[{"data":1,"prerenderedAt":841},["ShallowReactive",2],{"\u002Fblog\u002Fincident-postmortem-template":3},{"id":4,"title":5,"author":6,"body":8,"category":829,"date":830,"description":831,"extension":832,"image":833,"lastUpdated":833,"meta":834,"navigation":835,"path":836,"readingTime":837,"seo":838,"stem":839,"__hash__":840},"blog\u002Fblog\u002Fincident-postmortem-template.md","How to Write an Incident Postmortem (With Template)",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":798},"minimark",[11,16,20,23,26,30,33,39,61,66,77,80,84,87,90,95,224,228,231,321,324,328,331,337,351,354,358,361,378,381,385,388,391,408,412,415,417,431,435,438,539,544,558,563,577,579,583,587,590,594,597,603,609,612,615,619,636,639,643,682,685,689,693,696,700,703,720,723,727,730,734,737,741,744,748,751,754,785,788,792,795],[12,13,15],"h2",{"id":14},"the-outage-is-over-now-what","The Outage Is Over. Now What?",[17,18,19],"p",{},"The service is back up. The alert is resolved. The Slack channel has gone quiet. Everyone goes back to what they were doing - and three weeks later, the same failure happens for the same reason.",[17,21,22],{},"This is what happens without a postmortem. The team fixes the symptom but never addresses the cause. The knowledge about what went wrong lives in one engineer's head, and when they're on vacation during the next incident, the team starts from zero.",[17,24,25],{},"A postmortem is the practice of documenting what happened, why it happened, and what you're going to do to prevent it from happening again. It's not a blame exercise. It's not a formality. It's the single most effective way to turn an outage into a lasting improvement.",[12,27,29],{"id":28},"when-to-write-a-postmortem","When to Write a Postmortem",[17,31,32],{},"Not every incident needs a full postmortem. Writing one for a 2-minute blip caused by a transient network issue is overhead that doesn't produce value.",[17,34,35],{},[36,37,38],"strong",{},"Write a postmortem when:",[40,41,42,46,49,52,55,58],"ul",{},[43,44,45],"li",{},"The incident lasted longer than 15 minutes",[43,47,48],{},"Customers were visibly affected (support tickets, status page update, social media mentions)",[43,50,51],{},"The incident involved a failure mode you haven't seen before",[43,53,54],{},"The root cause wasn't immediately obvious",[43,56,57],{},"Multiple teams were involved in the response",[43,59,60],{},"An SLA was breached or credits were issued",[17,62,63],{},[36,64,65],{},"Skip the postmortem when:",[40,67,68,71,74],{},[43,69,70],{},"The incident was under 5 minutes and auto-resolved",[43,72,73],{},"The cause and fix were immediately obvious and already documented",[43,75,76],{},"No customers were affected",[17,78,79],{},"When in doubt, write the postmortem. A 30-minute write-up that prevents a future 2-hour outage is time well spent.",[12,81,83],{"id":82},"the-postmortem-template","The Postmortem Template",[17,85,86],{},"Here's the template. Copy it, fill it in, and share it with your team.",[88,89],"hr",{},[91,92,94],"h3",{"id":93},"incident-summary","Incident Summary",[96,97,98,111],"table",{},[99,100,101],"thead",{},[102,103,104,108],"tr",{},[105,106,107],"th",{},"Field",[105,109,110],{},"Value",[112,113,114,128,140,152,164,176,188,200,212],"tbody",{},[102,115,116,122],{},[117,118,119],"td",{},[36,120,121],{},"Incident title",[117,123,124],{},[125,126,127],"span",{},"Clear, descriptive name - e.g., \"API 5xx errors due to database connection pool exhaustion\"",[102,129,130,135],{},[117,131,132],{},[36,133,134],{},"Date",[117,136,137],{},[125,138,139],{},"When it happened",[102,141,142,147],{},[117,143,144],{},[36,145,146],{},"Duration",[117,148,149],{},[125,150,151],{},"Total time from first failure to confirmed recovery",[102,153,154,159],{},[117,155,156],{},[36,157,158],{},"Severity",[117,160,161],{},[125,162,163],{},"Critical \u002F Major \u002F Minor",[102,165,166,171],{},[117,167,168],{},[36,169,170],{},"Detection method",[117,172,173],{},[125,174,175],{},"How was it discovered? Monitoring alert, customer report, internal observation",[102,177,178,183],{},[117,179,180],{},[36,181,182],{},"Time to detect",[117,184,185],{},[125,186,187],{},"Minutes from first failure to first alert",[102,189,190,195],{},[117,191,192],{},[36,193,194],{},"Services affected",[117,196,197],{},[125,198,199],{},"List all affected services, endpoints, or features",[102,201,202,207],{},[117,203,204],{},[36,205,206],{},"Customer impact",[117,208,209],{},[125,210,211],{},"Number of users affected, error rates, revenue impact if known",[102,213,214,219],{},[117,215,216],{},[36,217,218],{},"Incident commander",[117,220,221],{},[125,222,223],{},"Who led the response",[91,225,227],{"id":226},"timeline","Timeline",[17,229,230],{},"Document every significant event, in chronological order. Include timestamps.",[96,232,233,243],{},[99,234,235],{},[102,236,237,240],{},[105,238,239],{},"Time (UTC)",[105,241,242],{},"Event",[112,244,245,253,265,273,281,289,297,305,313],{},[102,246,247,250],{},[117,248,249],{},"14:00",[117,251,252],{},"Deployment of v2.4.1 to production",[102,254,255,258],{},[117,256,257],{},"14:03",[117,259,260,261],{},"Monitoring detects elevated 5xx error rate on ",[262,263,264],"code",{},"\u002Fapi\u002Forders",[102,266,267,270],{},[117,268,269],{},"14:04",[117,271,272],{},"Alert fires in #incidents Slack channel",[102,274,275,278],{},[117,276,277],{},"14:06",[117,279,280],{},"On-call engineer acknowledges, begins investigation",[102,282,283,286],{},[117,284,285],{},"14:12",[117,287,288],{},"Root cause identified: new query in v2.4.1 missing an index, causing full table scans",[102,290,291,294],{},[117,292,293],{},"14:14",[117,295,296],{},"Decision to rollback to v2.4.0",[102,298,299,302],{},[117,300,301],{},"14:18",[117,303,304],{},"Rollback deployed",[102,306,307,310],{},[117,308,309],{},"14:21",[117,311,312],{},"Error rate returns to baseline, monitoring confirms recovery",[102,314,315,318],{},[117,316,317],{},"14:22",[117,319,320],{},"Incident resolved, recovery notification sent",[17,322,323],{},"Be precise. \"Around 2 PM\" isn't useful. \"14:03 UTC\" is. Your monitoring tool's incident timeline is the best source for accurate timestamps - don't rely on memory.",[91,325,327],{"id":326},"root-cause","Root Cause",[17,329,330],{},"Explain the technical root cause. Be specific enough that another engineer could understand the failure without having been there.",[17,332,333,336],{},[36,334,335],{},"Bad root cause:"," \"The database was slow.\"",[17,338,339,342,343,346,347,350],{},[36,340,341],{},"Good root cause:"," \"Deployment v2.4.1 introduced a new query on the ",[262,344,345],{},"orders"," table that filtered by ",[262,348,349],{},"customer_id"," without an index. Under production load (~2,000 queries\u002Fmin), this caused full table scans that exhausted the database connection pool within 3 minutes. Subsequent requests to any endpoint using the primary database connection returned 503 errors.\"",[17,352,353],{},"The root cause should answer: what specifically broke, why it broke, and why existing safeguards didn't prevent it.",[91,355,357],{"id":356},"contributing-factors","Contributing Factors",[17,359,360],{},"Root cause is the direct trigger. Contributing factors are the conditions that allowed it to become an incident:",[40,362,363,369,372,375],{},[43,364,365,366,368],{},"The query wasn't caught in code review because the ",[262,367,345],{}," table is small in staging (500 rows vs. 4 million in production)",[43,370,371],{},"No automated query performance testing in the CI pipeline",[43,373,374],{},"The database connection pool was sized for normal load with no headroom for query degradation",[43,376,377],{},"Deployment happened at 14:00 (peak traffic) instead of during a low-traffic window",[17,379,380],{},"Contributing factors are where the most valuable action items come from. Fixing the root cause prevents this exact incident. Fixing contributing factors prevents entire categories of incidents.",[91,382,384],{"id":383},"what-went-well","What Went Well",[17,386,387],{},"Every incident response has things that worked. Documenting them reinforces good practices and helps the team see that incident response isn't all failure.",[17,389,390],{},"Examples:",[40,392,393,396,399,402,405],{},[43,394,395],{},"Monitoring detected the issue within 3 minutes of deployment",[43,397,398],{},"The on-call engineer acknowledged the alert within 2 minutes",[43,400,401],{},"The team made the rollback decision quickly instead of debugging in production",[43,403,404],{},"The status page updated automatically, reducing customer support tickets by an estimated 60%",[43,406,407],{},"The rollback procedure was documented and worked on the first attempt",[91,409,411],{"id":410},"what-didnt-go-well","What Didn't Go Well",[17,413,414],{},"Be honest. This isn't about blame - it's about identifying weak points.",[17,416,390],{},[40,418,419,422,425,428],{},[43,420,421],{},"The deployment wasn't flagged as high-risk despite touching a high-traffic query path",[43,423,424],{},"There was no pre-deployment performance check against production-scale data",[43,426,427],{},"The rollback took 4 minutes because the CI pipeline had to rebuild the previous version",[43,429,430],{},"Two engineers started investigating independently before coordinating, wasting 5 minutes of duplicate effort",[91,432,434],{"id":433},"action-items","Action Items",[17,436,437],{},"This is the most important section. Action items are the commitments that prevent this class of incident from recurring. Every action item needs an owner and a deadline - otherwise it becomes a wish list that nobody follows up on.",[96,439,440,456],{},[99,441,442],{},[102,443,444,447,450,453],{},[105,445,446],{},"Action Item",[105,448,449],{},"Owner",[105,451,452],{},"Deadline",[105,454,455],{},"Status",[112,457,458,475,489,502,515,527],{},[102,459,460,466,469,472],{},[117,461,462,463],{},"Add index on ",[262,464,465],{},"orders.customer_id",[117,467,468],{},"@backend-lead",[117,470,471],{},"Jun 25",[117,473,474],{},"✅ Done",[102,476,477,480,483,486],{},[117,478,479],{},"Add query performance testing to CI pipeline",[117,481,482],{},"@platform-eng",[117,484,485],{},"Jul 15",[117,487,488],{},"🔲 Open",[102,490,491,494,497,500],{},[117,492,493],{},"Increase database connection pool from 20 to 50",[117,495,496],{},"@infra",[117,498,499],{},"Jun 23",[117,501,474],{},[102,503,504,507,510,513],{},[117,505,506],{},"Move high-risk deployments to low-traffic windows (before 10 AM)",[117,508,509],{},"@eng-manager",[117,511,512],{},"Ongoing",[117,514,488],{},[102,516,517,520,522,525],{},[117,518,519],{},"Add slow-query alerting (> 500ms) to database monitoring",[117,521,482],{},[117,523,524],{},"Jul 1",[117,526,488],{},[102,528,529,532,534,537],{},[117,530,531],{},"Pre-build rollback artifacts so rollbacks don't require a CI build",[117,533,482],{},[117,535,536],{},"Jul 30",[117,538,488],{},[17,540,541],{},[36,542,543],{},"Good action items are:",[40,545,546,549,552,555],{},[43,547,548],{},"Specific (not \"improve database performance\")",[43,550,551],{},"Achievable (not \"eliminate all database-related incidents\")",[43,553,554],{},"Measurable (you can verify whether it was done)",[43,556,557],{},"Assigned to a person, not a team",[17,559,560],{},[36,561,562],{},"Bad action items are:",[40,564,565,568,571,574],{},[43,566,567],{},"\"Be more careful\" (not actionable)",[43,569,570],{},"\"Test more\" (not specific)",[43,572,573],{},"\"Improve monitoring\" (not measurable)",[43,575,576],{},"Unassigned (nobody owns it, nobody does it)",[88,578],{},[12,580,582],{"id":581},"running-the-postmortem-meeting","Running the Postmortem Meeting",[91,584,586],{"id":585},"schedule-it-within-48-hours","Schedule It Within 48 Hours",[17,588,589],{},"The longer you wait, the less accurate the timeline becomes. Details fade, context is lost, and the urgency to prevent recurrence fades with it. Aim to hold the postmortem review within 1–2 business days of the incident.",[91,591,593],{"id":592},"keep-it-blameless","Keep It Blameless",[17,595,596],{},"Blameless doesn't mean consequence-free. It means focusing on systems and processes rather than individual mistakes.",[17,598,599,602],{},[36,600,601],{},"Blameless:"," \"The deployment pipeline didn't include a performance regression check, so the slow query reached production.\"",[17,604,605,608],{},[36,606,607],{},"Blame-ful:"," \"The engineer who wrote the query should have known it would be slow.\"",[17,610,611],{},"The first framing leads to a systemic fix (add performance testing). The second leads to engineers being afraid to deploy - which is worse for reliability than the original incident.",[17,613,614],{},"If people fear punishment, they'll hide mistakes. If they feel safe, they'll surface problems early. Blameless postmortems are a reliability investment, not a cultural nicety.",[91,616,618],{"id":617},"invite-the-right-people","Invite the Right People",[40,620,621,624,627,630,633],{},[43,622,623],{},"Everyone directly involved in the incident response",[43,625,626],{},"The engineer who made the change that triggered the incident (they have the most context)",[43,628,629],{},"The on-call engineer who responded",[43,631,632],{},"A representative from any other affected team",[43,634,635],{},"Optionally: engineering leadership (to observe, not to direct)",[17,637,638],{},"Keep the group small enough for productive discussion (4–8 people). Larger incidents can have a broader read-out afterward.",[91,640,642],{"id":641},"follow-a-structure","Follow a Structure",[644,645,646,652,658,664,670,676],"ol",{},[43,647,648,651],{},[36,649,650],{},"Review the timeline"," (10 min) - Walk through what happened chronologically. Fill in gaps, correct timestamps.",[43,653,654,657],{},[36,655,656],{},"Discuss root cause and contributing factors"," (15 min) - Agree on why it happened. Challenge assumptions.",[43,659,660,663],{},[36,661,662],{},"Discuss what went well"," (5 min) - Reinforce good practices.",[43,665,666,669],{},[36,667,668],{},"Discuss what didn't go well"," (10 min) - Identify gaps without assigning blame.",[43,671,672,675],{},[36,673,674],{},"Define action items"," (15 min) - Assign owners and deadlines. Prioritize by impact.",[43,677,678,681],{},[36,679,680],{},"Schedule follow-up"," (5 min) - When will action items be reviewed?",[17,683,684],{},"Total: about 60 minutes. If the incident was minor, 30 minutes is enough. If it was a major outage, allow 90 minutes.",[12,686,688],{"id":687},"common-postmortem-mistakes","Common Postmortem Mistakes",[91,690,692],{"id":691},"writing-it-and-forgetting-it","Writing It and Forgetting It",[17,694,695],{},"The postmortem document isn't the deliverable - the action items are. If nobody follows up on action items, the postmortem was a waste of time. Schedule a follow-up review 2–4 weeks later to check completion status.",[91,697,699],{"id":698},"stopping-at-the-obvious-root-cause","Stopping at the Obvious Root Cause",[17,701,702],{},"\"The database was overloaded\" is a symptom, not a root cause. Keep asking \"why\" until you reach the systemic issue:",[40,704,705,708,711,714],{},[43,706,707],{},"Why was the database overloaded? → A slow query was introduced.",[43,709,710],{},"Why wasn't the slow query caught? → No performance testing in CI.",[43,712,713],{},"Why is there no performance testing? → Nobody has built it yet.",[43,715,716,719],{},[36,717,718],{},"Action item:"," Build query performance testing into the CI pipeline.",[17,721,722],{},"The first \"why\" gives you a patch. The fifth \"why\" gives you a prevention.",[91,724,726],{"id":725},"making-it-a-blame-session","Making It a Blame Session",[17,728,729],{},"The moment someone says \"who deployed this?\" in an accusatory tone, the postmortem stops being productive. People get defensive, information stops flowing, and the real systemic issues go unaddressed. The facilitator's job is to redirect from \"who\" to \"why\" and \"how.\"",[91,731,733],{"id":732},"too-many-action-items","Too Many Action Items",[17,735,736],{},"A postmortem with 15 action items will complete 2 of them. Prioritize ruthlessly. Three high-impact action items that get done are worth more than fifteen that sit in a backlog. Focus on the items that prevent the broadest class of incidents, not just this specific one.",[91,738,740],{"id":739},"no-severity-calibration","No Severity Calibration",[17,742,743],{},"Writing a 3-page postmortem for a 5-minute blip wastes time. A 2-sentence summary for a 4-hour customer-facing outage wastes the learning opportunity. Match the depth of the postmortem to the severity of the incident.",[12,745,747],{"id":746},"using-incident-data-to-write-better-postmortems","Using Incident Data to Write Better Postmortems",[17,749,750],{},"The hardest part of a postmortem is reconstructing an accurate timeline. Memory is unreliable during incidents - stress compresses time, and people remember the order of events differently.",[17,752,753],{},"Your monitoring tool is the source of truth. The incident record should include:",[40,755,756,762,768,774,779],{},[43,757,758,761],{},[36,759,760],{},"Exact start time"," - When the first check failed",[43,763,764,767],{},[36,765,766],{},"Detection time"," - How long between first failure and first alert",[43,769,770,773],{},[36,771,772],{},"Affected regions"," - Was it global or regional?",[43,775,776,778],{},[36,777,146],{}," - Precise time from start to confirmed recovery",[43,780,781,784],{},[36,782,783],{},"Response time trends"," - Was performance degrading before the outage? This shows whether the incident was sudden or a gradual decline that could have been caught earlier.",[17,786,787],{},"Vantaj logs every incident with an automatic timeline: when it started, which regions were affected, when the alert fired, and when the service recovered. This gives your postmortem an accurate, timestamped foundation - no guesswork, no conflicting recollections.",[12,789,791],{"id":790},"the-postmortem-habit","The Postmortem Habit",[17,793,794],{},"Teams that write postmortems consistently see their MTTR decrease over time. Not because each postmortem is revolutionary, but because the cumulative effect of dozens of small improvements - better runbooks, faster rollbacks, tighter monitoring, improved alerting - compounds into a fundamentally more resilient system.",[17,796,797],{},"The best time to write your first postmortem was after your last incident. The second best time is after the next one. Start with the template above, keep it blameless, follow up on the action items, and make it a habit. Your future on-call team will thank you.",{"title":799,"searchDepth":800,"depth":800,"links":801},"",2,[802,803,804,814,820,827,828],{"id":14,"depth":800,"text":15},{"id":28,"depth":800,"text":29},{"id":82,"depth":800,"text":83,"children":805},[806,808,809,810,811,812,813],{"id":93,"depth":807,"text":94},3,{"id":226,"depth":807,"text":227},{"id":326,"depth":807,"text":327},{"id":356,"depth":807,"text":357},{"id":383,"depth":807,"text":384},{"id":410,"depth":807,"text":411},{"id":433,"depth":807,"text":434},{"id":581,"depth":800,"text":582,"children":815},[816,817,818,819],{"id":585,"depth":807,"text":586},{"id":592,"depth":807,"text":593},{"id":617,"depth":807,"text":618},{"id":641,"depth":807,"text":642},{"id":687,"depth":800,"text":688,"children":821},[822,823,824,825,826],{"id":691,"depth":807,"text":692},{"id":698,"depth":807,"text":699},{"id":725,"depth":807,"text":726},{"id":732,"depth":807,"text":733},{"id":739,"depth":807,"text":740},{"id":746,"depth":800,"text":747},{"id":790,"depth":800,"text":791},"tutorials","2026-06-21","A postmortem that actually prevents the next outage. Here's a step-by-step guide with a ready-to-use template, real examples, and the mistakes that turn postmortems into wasted meetings.","md",null,{},true,"\u002Fblog\u002Fincident-postmortem-template",10,{"title":5,"description":831},"blog\u002Fincident-postmortem-template","ilz6cddk9q9i_nLvfmaPVZoMPaJsoQMOfp3_eSyXNVs",1782222710359]