[{"data":1,"prerenderedAt":636},["ShallowReactive",2],{"\u002Fblog\u002Fon-call-survival-guide":3},{"id":4,"title":5,"author":6,"body":8,"category":625,"date":626,"description":627,"extension":628,"image":629,"lastUpdated":629,"meta":630,"navigation":378,"path":631,"readingTime":632,"seo":633,"stem":634,"__hash__":635},"blog\u002Fblog\u002Fon-call-survival-guide.md","On-Call Survival Guide: From First Alert to Postmortem",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":583},"minimark",[11,15,18,21,24,29,32,37,40,44,47,58,65,69,78,82,85,89,92,94,98,101,105,108,111,115,118,122,125,129,132,136,139,143,146,162,164,168,171,175,178,181,185,268,271,273,277,281,295,298,302,308,310,314,317,321,327,330,334,348,352,355,358,360,364,367,423,426,453,455,459,462,468,482,485,512,515,517,521,524,530,532,537,544,548,552,555,559,562,566,569,573,576,580],[12,13,14],"p",{},"Most on-call guides cover what to set up. This one covers what to do when the alert fires.",[12,16,17],{},"The difference between a 15-minute incident and a 3-hour incident is usually not technical knowledge — it's process. Teams that recover quickly have a repeatable structure they follow under pressure. Teams that spiral have good intentions and no structure.",[12,19,20],{},"This guide covers the full incident arc: alert fires, you respond, you diagnose, you communicate, you fix it, you close it, and you make sure it doesn't happen again.",[22,23],"hr",{},[25,26,28],"h2",{"id":27},"the-first-5-minutes","The First 5 Minutes",[12,30,31],{},"The first five minutes are the most chaotic. Don't try to fix the problem yet. Contain the chaos so fixing becomes possible.",[33,34,36],"h3",{"id":35},"step-1-acknowledge-the-alert-30-seconds","Step 1: Acknowledge the alert (30 seconds)",[12,38,39],{},"Acknowledge in whichever tool fired the alert. This prevents duplicate response and signals to your team that someone is handling it. If you're in a rotation, this starts your response clock for SLA purposes.",[33,41,43],{"id":42},"step-2-post-to-the-incident-channel-1-minute","Step 2: Post to the incident channel (1 minute)",[12,45,46],{},"Open your designated incident channel and post:",[48,49,54],"pre",{"className":50,"code":52,"language":53},[51],"language-text","🔴 Investigating: [service name] \u002F [alert name]\nIC: @yourname\nStatus page updated: [link]\n","text",[55,56,52],"code",{"__ignoreMap":57},"",[12,59,60,61,64],{},"Don't wait until you know more. Post now. Your team can see the incident is being handled. Create a dedicated channel (",[55,62,63],{},"#inc-2026-06-26-api-errors",") so the technical thread stays separate from the main engineering channel.",[33,66,68],{"id":67},"step-3-update-the-status-page-1-minute","Step 3: Update the status page (1 minute)",[12,70,71,72,77],{},"Post \"Investigating\" before you know the cause. Customers seeing an update within 3 minutes trust you more than customers who see nothing for 20 minutes, even with no new information. See ",[73,74,76],"a",{"href":75},"\u002Fblog\u002Fincident-communication-templates","incident communication templates"," for copy-ready status page text.",[33,79,81],{"id":80},"step-4-note-the-exact-incident-start-time-30-seconds","Step 4: Note the exact incident start time (30 seconds)",[12,83,84],{},"The timestamp when the alert fired (not when you acknowledged it) is the start time for your SLA calculations, your postmortem timeline, and your customer communication. Note it somewhere you won't lose it.",[33,86,88],{"id":87},"step-5-open-the-runbook-if-one-exists","Step 5: Open the runbook if one exists",[12,90,91],{},"If there's a runbook for this alert, follow it before doing anything else. Runbooks exist because someone solved this problem before. Trust the documented process before going off-script, even if you have a strong intuition about the cause.",[22,93],{},[25,95,97],{"id":96},"the-diagnosis-framework-dime","The Diagnosis Framework: DIME",[12,99,100],{},"When there's no runbook, work through this checklist in order. Most incidents have a cause in one of these four categories.",[33,102,104],{"id":103},"d-deployments","D — Deployments",[12,106,107],{},"What changed in the last 2 hours? A recent deployment is the most common cause of production incidents. Check your deployment log first, before looking anywhere else. The most expensive diagnostic mistakes happen when teams spend 30 minutes debugging application behavior when the cause is a config value that changed 90 minutes ago.",[12,109,110],{},"If a deployment is the likely cause: roll it back first, then verify, then investigate why the deployment caused the failure. Don't investigate while the incident is ongoing.",[33,112,114],{"id":113},"i-infrastructure","I — Infrastructure",[12,116,117],{},"Did anything change in the underlying infrastructure? Autoscaling events, database migrations, certificate rotations, new firewall rules, DNS changes. Cloud providers have their own status pages; check them in the first 10 minutes of any incident that touches their services.",[33,119,121],{"id":120},"m-metrics","M — Metrics",[12,123,124],{},"What do your metrics show? Look for the spike that correlates with the incident start time. Error rate, CPU, memory, database connections, request queue depth, external API latency. The metric that spiked at the exact time the first failure occurred is usually the signal you need.",[33,126,128],{"id":127},"e-external","E — External",[12,130,131],{},"Is a third-party dependency the root cause? Payment processor, email provider, CDN, authentication service, cloud infrastructure. Check vendor status pages before spending 30 minutes debugging your own code. Stripe, AWS, Cloudflare, and Twilio all have status pages; check them in the first 10 minutes of any incident that touches their services.",[33,133,135],{"id":134},"the-5-minute-rule","The 5-minute rule",[12,137,138],{},"Move to the next item on the DIME checklist if you've spent 5 minutes on a hypothesis without finding confirmation. Staring at the same logs longer doesn't produce new information. A systematic switch to the next category usually does.",[33,140,142],{"id":141},"when-to-escalate","When to escalate",[12,144,145],{},"Escalate after 15 minutes without identifying a root cause. The cost of waking someone up is lower than the cost of 45 more minutes of solo diagnosis. When you escalate:",[147,148,149,153,156,159],"ul",{},[150,151,152],"li",{},"State what you've ruled out (not just what you've tried)",[150,154,155],{},"Share exact error messages, not your interpretation of them",[150,157,158],{},"Include the relevant timestamps",[150,160,161],{},"State your current hypothesis, if any",[22,163],{},[25,165,167],{"id":166},"communication-during-the-incident","Communication During the Incident",[12,169,170],{},"Communication during an incident is a separate skill from debugging. The on-call engineer should not be doing both simultaneously. When a second person is available, assign one person to technical diagnosis and one person to communication.",[33,172,174],{"id":173},"the-15-minute-update-rule","The 15-minute update rule",[12,176,177],{},"Post a status update to the status page and to the internal incident channel every 15 minutes, without exception. Even when there's nothing new to report. \"Still investigating, no changes to report\" is a valid update. Silence is not.",[12,179,180],{},"Customers and stakeholders who see no update for 30 minutes assume the team is either not actively working on it or hiding something. Neither is a good impression to create during a production outage.",[33,182,184],{"id":183},"severity-levels","Severity levels",[186,187,188,208],"table",{},[189,190,191],"thead",{},[192,193,194,198,201,205],"tr",{},[195,196,197],"th",{},"Level",[195,199,200],{},"Description",[195,202,204],{"align":203},"center","Response time",[195,206,207],{},"External communication",[209,210,211,226,240,254],"tbody",{},[192,212,213,217,220,223],{},[214,215,216],"td",{},"P1",[214,218,219],{},"Full outage, all users affected",[214,221,222],{"align":203},"Immediate",[214,224,225],{},"Status page + customer email",[192,227,228,231,234,237],{},[214,229,230],{},"P2",[214,232,233],{},"Major feature broken, significant user impact",[214,235,236],{"align":203},"Within 5 min",[214,238,239],{},"Status page update",[192,241,242,245,248,251],{},[214,243,244],{},"P3",[214,246,247],{},"Minor feature broken, small user subset",[214,249,250],{"align":203},"Within 30 min",[214,252,253],{},"Status page if customer-visible",[192,255,256,259,262,265],{},[214,257,258],{},"P4",[214,260,261],{},"Internal tools, no customer impact",[214,263,264],{"align":203},"Business hours",[214,266,267],{},"Internal only",[12,269,270],{},"Classify at the start of the incident and adjust if the scope changes. Misclassifying a P1 as a P2 delays communication and escalation.",[22,272],{},[25,274,276],{"id":275},"closing-the-incident","Closing the Incident",[33,278,280],{"id":279},"before-declaring-resolved","Before declaring resolved",[147,282,283,286,289,292],{},[150,284,285],{},"Confirm the fix is deployed and has been running stably for at least 5 minutes",[150,287,288],{},"Verify error rate has returned to baseline — not just improved, but returned",[150,290,291],{},"Check from multiple regions if your monitoring supports it",[150,293,294],{},"Confirm no secondary failures have appeared",[12,296,297],{},"A false \"resolved\" declaration followed by a second failure is worse than staying in \"monitoring\" status longer. Users who see an outage end and then resume 10 minutes later lose significantly more trust than users who see an extended monitoring window.",[33,299,301],{"id":300},"after-declaring-resolved","After declaring resolved",[12,303,304,305,307],{},"Post the resolution to your status page and your incident channel. Within 2 hours, send a customer email for P1 and major P2 incidents. Use the templates in ",[73,306,76],{"href":75},".",[22,309],{},[25,311,313],{"id":312},"writing-runbooks-that-get-used","Writing Runbooks That Get Used",[12,315,316],{},"A runbook is only as useful as it is usable under pressure. At 3 AM with adrenaline running, a 5-page document is not useful. A 10-step checklist with specific commands is.",[33,318,320],{"id":319},"the-minimal-runbook-structure","The minimal runbook structure",[48,322,325],{"className":323,"code":324,"language":53},[51],"# [Service Name] — [Alert Name]\n\nWhat this alert means: [1 sentence]\nEscalate to @name if not resolved in 15 minutes\n\n## Step 1: Check [X]\nCommand: [exact command or link]\nExpected output: [what normal looks like]\nIf abnormal: [next step or escalate]\n\n## Step 2: Check [Y]\nCommand: [exact command or link]\nExpected output: [what normal looks like]\nIf abnormal: [next step or escalate]\n\n## Step 3: Escalate\nPage: @name (primary), @name (backup)\nInclude: what you've ruled out, error messages, timestamps\n",[55,326,324],{"__ignoreMap":57},[12,328,329],{},"Three to five steps, exact commands, explicit escalation path. Everything else goes in a linked architecture doc, not in the runbook itself.",[33,331,333],{"id":332},"what-to-leave-out","What to leave out",[147,335,336,339,342,345],{},[150,337,338],{},"Background and history (link to a separate doc)",[150,340,341],{},"\"Check the obvious things\" — be specific about which things",[150,343,344],{},"Long prose explanations — use numbered steps and commands",[150,346,347],{},"Why a decision was made — save that for the postmortem",[33,349,351],{"id":350},"which-alerts-need-runbooks","Which alerts need runbooks",[12,353,354],{},"Every P1 and P2 alert. Any alert that caused more than 20 minutes of investigation before. Any alert where the on-call engineer needed to ask a colleague what to do.",[12,356,357],{},"If you've had 10 incidents in the past year with no runbooks, you've paid to figure out each one from scratch 10 times. Write the runbook once.",[22,359],{},[25,361,363],{"id":362},"the-on-call-setup-checklist","The On-Call Setup Checklist",[12,365,366],{},"Before your rotation starts:",[147,368,371,381,387,393,399,405,411,417],{"className":369},[370],"contains-task-list",[150,372,375,380],{"className":373},[374],"task-list-item",[376,377],"input",{"disabled":378,"type":379},true,"checkbox"," Alert routing reaches your phone, not just email",[150,382,384,386],{"className":383},[374],[376,385],{"disabled":378,"type":379}," Escalation path documented: who is paged if you don't respond in 5 minutes",[150,388,390,392],{"className":389},[374],[376,391],{"disabled":378,"type":379}," Monitoring dashboard bookmarked and accessible on mobile",[150,394,396,398],{"className":395},[374],[376,397],{"disabled":378,"type":379}," Status page access confirmed from your phone",[150,400,402,404],{"className":401},[374],[376,403],{"disabled":378,"type":379}," Runbooks for your top 5 alerts are accessible — not \"somewhere in the wiki\"",[150,406,408,410],{"className":407},[374],[376,409],{"disabled":378,"type":379}," Incident channel designated or ready to create",[150,412,414,416],{"className":413},[374],[376,415],{"disabled":378,"type":379}," Current deployments noted: anything shipped in the last 48 hours",[150,418,420,422],{"className":419},[374],[376,421],{"disabled":378,"type":379}," Production access confirmed: you can run queries, restart services, SSH if needed",[12,424,425],{},"After your rotation ends:",[147,427,429,435,441,447],{"className":428},[370],[150,430,432,434],{"className":431},[374],[376,433],{"disabled":378,"type":379}," Open issues documented and handed to the next on-call",[150,436,438,440],{"className":437},[374],[376,439],{"disabled":378,"type":379}," Postmortems for incidents during your rotation completed or scheduled",[150,442,444,446],{"className":443},[374],[376,445],{"disabled":378,"type":379}," Missing runbooks added to backlog",[150,448,450,452],{"className":449},[374],[376,451],{"disabled":378,"type":379}," Alert threshold issues flagged for improvement",[22,454],{},[25,456,458],{"id":457},"building-a-healthier-on-call-culture","Building a Healthier On-Call Culture",[12,460,461],{},"On-call is a tax on engineering teams. High-functioning teams minimize it. Low-functioning teams normalize it.",[12,463,464],{},[465,466,467],"strong",{},"Signs the on-call rotation is unsustainable:",[147,469,470,473,476,479],{},[150,471,472],{},"More than 2–3 pages per week per on-call engineer",[150,474,475],{},"Pages firing between midnight and 6 AM more than once a week",[150,477,478],{},"Engineers muting alert channels",[150,480,481],{},"Turnover correlated with on-call rotation",[12,483,484],{},"The fix is almost always the same four changes:",[486,487,488,494,500,506],"ol",{},[150,489,490,493],{},[465,491,492],{},"Audit alert thresholds."," Most teams have monitors set to alert on single failures from single locations. Requiring 2 consecutive failures from multiple regions eliminates the majority of false positives without meaningfully delaying real incident detection.",[150,495,496,499],{},[465,497,498],{},"Add multi-region consensus."," A single-region monitor that sees a transient routing issue fires an alert that wakes someone up for a problem that self-resolved in 30 seconds. Multi-region consensus means an alert only fires when multiple independent probe locations all confirm the failure.",[150,501,502,505],{},[465,503,504],{},"Write runbooks for the 5 most common alerts."," Teams that have runbooks recover faster and escalate less. The time investment in writing a runbook is paid back in the first incident that uses it.",[150,507,508,511],{},[465,509,510],{},"Hold blameless postmortems."," Teams that run postmortems fix systemic causes instead of just patching symptoms. The same incident type stops recurring.",[12,513,514],{},"The end state is an on-call rotation where real incidents are uncommon, detection is fast, runbooks exist for known failure patterns, and the engineer on call can sleep.",[22,516],{},[25,518,520],{"id":519},"quick-reference-card","Quick Reference Card",[12,522,523],{},"Keep this accessible during incidents.",[48,525,528],{"className":526,"code":527,"language":53},[51],"Alert fires\n    │\n    ├─ Acknowledge in alerting tool\n    ├─ Post to #incidents (template: 🔴 Investigating...)\n    ├─ Update status page (\"Investigating\")\n    ├─ Note incident start time\n    │\n    └─ Runbook exists for this alert?\n           │\n           ├─ YES: Follow it\n           │\n           └─ NO: DIME checklist\n                    │\n                    ├─ D: Recent deployment? → Roll back, then investigate\n                    ├─ I: Infrastructure change? → Revert if possible\n                    ├─ M: Metrics spike? → Correlate with incident start time\n                    └─ E: External dependency? → Check vendor status page\n                    │\n                    └─ 15 min, no root cause?\n                           └─ Escalate: what you've ruled out +\n                              exact errors + timestamps + hypothesis\n",[55,529,527],{"__ignoreMap":57},[22,531],{},[12,533,534,535,307],{},"For status page update text, customer email templates, and Slack announcement copy, see ",[73,536,76],{"href":75},[12,538,539,540,307],{},"For writing the postmortem after the incident, see ",[73,541,543],{"href":542},"\u002Fblog\u002Fincident-postmortem-template","how to write an incident postmortem",[25,545,547],{"id":546},"frequently-asked-questions","Frequently Asked Questions",[33,549,551],{"id":550},"whats-the-difference-between-an-incident-commander-and-the-on-call-engineer","What's the difference between an incident commander and the on-call engineer?",[12,553,554],{},"The on-call engineer is paged first and does the initial diagnosis. The incident commander (IC) coordinates the response once more people are involved — tracking progress, managing communication, deciding on escalation. For small teams, one person often fills both roles. For larger incidents, separating them prevents the technical diagnosis from being interrupted by communication tasks.",[33,556,558],{"id":557},"how-long-should-i-try-to-diagnose-before-escalating","How long should I try to diagnose before escalating?",[12,560,561],{},"15 minutes. If you haven't identified the root cause in 15 minutes, the next set of eyes will almost always make a difference. The cost of waking someone up is real but bounded. The cost of a P1 running 45 minutes longer because you didn't want to wake anyone up is higher.",[33,563,565],{"id":564},"should-i-roll-back-a-deployment-before-fully-diagnosing-the-cause","Should I roll back a deployment before fully diagnosing the cause?",[12,567,568],{},"Yes, for P1 incidents. Roll back first, verify recovery, then investigate why the deployment caused the failure. The priority during an active incident is restoring service, not understanding the root cause. The postmortem is for understanding.",[33,570,572],{"id":571},"how-do-i-handle-an-incident-ive-never-seen-before","How do I handle an incident I've never seen before?",[12,574,575],{},"Work through DIME systematically. If you've exhausted DIME without a hypothesis, escalate with everything you've found. \"I've ruled out recent deployments, infrastructure changes, and our external dependencies. Metrics show a sharp increase in database connection errors starting at 14:32. I don't have a hypothesis yet\" is a complete escalation message.",[33,577,579],{"id":578},"what-makes-on-call-sustainable-long-term","What makes on-call sustainable long-term?",[12,581,582],{},"Low false positive rate (under 1 per week), runbooks for the most common alerts, fast detection that limits incident duration, and blameless postmortems that prevent recurrence. Teams that address these four things rarely have retention problems from on-call.",{"title":57,"searchDepth":584,"depth":584,"links":585},2,[586,594,602,606,610,615,616,617,618],{"id":27,"depth":584,"text":28,"children":587},[588,590,591,592,593],{"id":35,"depth":589,"text":36},3,{"id":42,"depth":589,"text":43},{"id":67,"depth":589,"text":68},{"id":80,"depth":589,"text":81},{"id":87,"depth":589,"text":88},{"id":96,"depth":584,"text":97,"children":595},[596,597,598,599,600,601],{"id":103,"depth":589,"text":104},{"id":113,"depth":589,"text":114},{"id":120,"depth":589,"text":121},{"id":127,"depth":589,"text":128},{"id":134,"depth":589,"text":135},{"id":141,"depth":589,"text":142},{"id":166,"depth":584,"text":167,"children":603},[604,605],{"id":173,"depth":589,"text":174},{"id":183,"depth":589,"text":184},{"id":275,"depth":584,"text":276,"children":607},[608,609],{"id":279,"depth":589,"text":280},{"id":300,"depth":589,"text":301},{"id":312,"depth":584,"text":313,"children":611},[612,613,614],{"id":319,"depth":589,"text":320},{"id":332,"depth":589,"text":333},{"id":350,"depth":589,"text":351},{"id":362,"depth":584,"text":363},{"id":457,"depth":584,"text":458},{"id":519,"depth":584,"text":520},{"id":546,"depth":584,"text":547,"children":619},[620,621,622,623,624],{"id":550,"depth":589,"text":551},{"id":557,"depth":589,"text":558},{"id":564,"depth":589,"text":565},{"id":571,"depth":589,"text":572},{"id":578,"depth":589,"text":579},"infrastructure","2026-06-26","A practical on-call guide for engineering teams: the first 5 minutes of an incident, the DIME diagnosis checklist, communication cadence, runbook structure, and blameless postmortems.","md",null,{},"\u002Fblog\u002Fon-call-survival-guide",11,{"title":5,"description":627},"blog\u002Fon-call-survival-guide","oWYFRHtHESix6ursAMLxIje-rROEBByS2Anild8jJKs",1782464113660]