[{"data":1,"prerenderedAt":842},["ShallowReactive",2],{"\u002Fblog\u002Fwebsite-outage-response-runbook":3},{"id":4,"title":5,"author":6,"body":8,"category":831,"date":832,"description":833,"extension":834,"image":835,"lastUpdated":835,"meta":836,"navigation":87,"path":837,"readingTime":838,"seo":839,"stem":840,"__hash__":841},"blog\u002Fblog\u002Fwebsite-outage-response-runbook.md","Website Outage Response Runbook: What to Do in the First 60 Minutes",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":814},"minimark",[11,15,18,21,26,29,52,55,58,62,67,72,75,116,119,121,125,128,184,187,189,193,216,226,229,231,235,238,262,265,267,271,274,277,282,310,315,330,335,356,370,375,396,401,422,484,489,510,512,516,519,522,549,552,567,570,572,576,579,600,603,605,609,612,627,658,682,684,688,691,699,702,710,719,721,725,728,734,737,739,743,776,780,810],[12,13,14],"p",{},"When your website goes down, the first 10 minutes are chaotic. People flood Slack. Someone starts investigating. Someone else starts a different investigation. Nobody has told customers anything. Nobody has posted to the status page. The CEO is DMing the on-call engineer.",[12,16,17],{},"A runbook stops the chaos before it starts. It replaces \"what do we do?\" with a sequence your team follows every time, regardless of who is on call.",[12,19,20],{},"This is a copy-ready runbook template for the first 60 minutes of a production outage.",[22,23,25],"h2",{"id":24},"before-you-need-this-prerequisites","Before You Need This: Prerequisites",[12,27,28],{},"This runbook assumes three things are in place:",[30,31,32,40,46],"ol",{},[33,34,35,39],"li",{},[36,37,38],"strong",{},"Uptime monitoring with alerting"," - you know about the outage from your monitoring tool, not from a customer tweet",[33,41,42,45],{},[36,43,44],{},"A status page"," - a public URL where customers check service status",[33,47,48,51],{},[36,49,50],{},"Defined severity levels"," - at minimum, a distinction between \"all users affected\" and \"some users affected\"",[12,53,54],{},"If any of these are missing, set them up before the next incident. Monitoring takes under 5 minutes to set up in Vantaj. Not having it is the single most expensive preparation gap.",[56,57],"hr",{},[22,59,61],{"id":60},"the-runbook","The Runbook",[63,64,66],"h3",{"id":65},"t0-alert-fires","T+0: Alert Fires",[12,68,69],{},[36,70,71],{},"The alert lands in Slack (or via SMS, email, or phone call).",[12,73,74],{},"Do these three things in the first 2 minutes:",[76,77,80,90,110],"ul",{"className":78},[79],"contains-task-list",[33,81,84,89],{"className":82},[83],"task-list-item",[85,86],"input",{"disabled":87,"type":88},true,"checkbox"," Open the monitoring dashboard. Note: which service, which regions, what error",[33,91,93,95,96,100,101,109],{"className":92},[83],[85,94],{"disabled":87,"type":88}," Claim the incident: post in ",[97,98,99],"code",{},"#incidents",": ",[36,102,103,104,108],{},"\"I'm on this. SEV-",[105,106,107],"span",{},"1\u002F2\u002F3",".\""," This single message prevents the situation where three people each assume someone else is handling it",[33,111,113,115],{"className":112},[83],[85,114],{"disabled":87,"type":88}," Check if it is real: go directly to the affected URL. Confirm the error from your browser",[12,117,118],{},"If you cannot reproduce the error manually, check whether your monitoring uses multi-region consensus. If it does and it still fired, the outage is real. If it does not, you may have a false positive.",[56,120],{},[63,122,124],{"id":123},"t2-severity-classification","T+2: Severity Classification",[12,126,127],{},"Classify the incident. This determines the next steps.",[129,130,131,147],"table",{},[132,133,134],"thead",{},[135,136,137,141,144],"tr",{},[138,139,140],"th",{},"Severity",[138,142,143],{},"Criteria",[138,145,146],{},"Response",[148,149,150,162,173],"tbody",{},[135,151,152,156,159],{},[153,154,155],"td",{},"SEV-1",[153,157,158],{},"Full outage, all users affected, or data loss risk",[153,160,161],{},"Wake everyone. Status page immediately.",[135,163,164,167,170],{},[153,165,166],{},"SEV-2",[153,168,169],{},"Partial outage or degraded service, significant user impact",[153,171,172],{},"On-call investigates. Status page within 5 min.",[135,174,175,178,181],{},[153,176,177],{},"SEV-3",[153,179,180],{},"Minor issue, small subset of users affected",[153,182,183],{},"On-call investigates. Internal tracking only.",[12,185,186],{},"When in doubt, classify up. A SEV-1 that turns out to be a SEV-2 is fine. A SEV-2 that was actually a SEV-1 costs you 20 minutes of delayed communication with customers.",[56,188],{},[63,190,192],{"id":191},"t3-open-the-incident-channel","T+3: Open the Incident Channel",[76,194,196,210],{"className":195},[79],[33,197,199,201,202,205,206,209],{"className":198},[83],[85,200],{"disabled":87,"type":88}," Create a Slack channel: ",[97,203,204],{},"#inc-YYYYMMDD-[short-description]"," (e.g., ",[97,207,208],{},"#inc-20260628-api-down",")",[33,211,213,215],{"className":212},[83],[85,214],{"disabled":87,"type":88}," Post the incident brief in the channel:",[217,218,223],"pre",{"className":219,"code":221,"language":222},[220],"language-text","🔴 INCIDENT OPEN\n\nService: [which service or endpoint]\nImpact: [what users experience]\nSeverity: SEV-[1\u002F2\u002F3]\nIncident Commander: @you\nStarted: [time] UTC\nMonitoring link: [direct link to the failing monitor]\nStatus page: [link]\n\nAll incident discussion here only.\n","text",[97,224,221],{"__ignoreMap":225},"",[12,227,228],{},"This channel becomes the incident timeline. Everything that happens, every hypothesis tested, every change made - post it here as it happens. You will need this log for the postmortem.",[56,230],{},[63,232,234],{"id":233},"t5-update-the-status-page","T+5: Update the Status Page",[12,236,237],{},"For SEV-1 and SEV-2, update the status page before you know the cause. Post this:",[239,240,241,249,256],"blockquote",{},[12,242,243],{},[36,244,245,246],{},"Investigating - ",[105,247,248],{},"Service Name",[12,250,251,252,255],{},"We are investigating reports of ",[105,253,254],{},"service"," being unavailable. Engineers are actively working on this.",[12,257,258,259],{},"Next update: ",[105,260,261],{},"T+20 from now",[12,263,264],{},"Customers who see this update stop filing support tickets. Support ticket volume during an acknowledged incident drops by 60-80% compared to a silent outage. You get your investigation time back.",[56,266],{},[63,268,270],{"id":269},"t5-to-t30-diagnosis","T+5 to T+30: Diagnosis",[12,272,273],{},"Your goal in this window: identify the category of the problem. You do not need the root cause yet. You need enough to either restore service or escalate.",[12,275,276],{},"Work through this checklist in order. Skip steps you can verify quickly.",[12,278,279],{},[36,280,281],{},"Check 1: Recent changes",[76,283,285,294,300],{"className":284},[79],[33,286,288,290,291],{"className":287},[83],[85,289],{"disabled":87,"type":88}," Was there a deployment in the last 30 minutes? ",[97,292,293],{},"git log --oneline -10",[33,295,297,299],{"className":296},[83],[85,298],{"disabled":87,"type":88}," Was there a config change, environment variable update, or infrastructure change?",[33,301,303,305,306,309],{"className":302},[83],[85,304],{"disabled":87,"type":88}," If yes to either: ",[36,307,308],{},"rollback first, investigate second",". Restoring service takes priority over understanding the cause.",[12,311,312],{},[36,313,314],{},"Check 2: External dependencies",[76,316,318,324],{"className":317},[79],[33,319,321,323],{"className":320},[83],[85,322],{"disabled":87,"type":88}," Check the status pages of your key dependencies: Stripe, Auth0, AWS, Cloudflare, Vercel, etc.",[33,325,327,329],{"className":326},[83],[85,328],{"disabled":87,"type":88}," If a dependency is down and you use it in the affected flow: that is your cause. Update the status page with the dependency name.",[12,331,332],{},[36,333,334],{},"Check 3: Server health",[76,336,338,344,350],{"className":337},[79],[33,339,341,343],{"className":340},[83],[85,342],{"disabled":87,"type":88}," CPU: is it pegged?",[33,345,347,349],{"className":346},[83],[85,348],{"disabled":87,"type":88}," Memory: is it near capacity?",[33,351,353,355],{"className":352},[83],[85,354],{"disabled":87,"type":88}," Disk: is it full?",[12,357,358,359,362,363,362,366,369],{},"For each: ",[97,360,361],{},"top",", ",[97,364,365],{},"free -h",[97,367,368],{},"df -h",". On cloud providers, check your dashboard for resource graphs with the spike visible.",[12,371,372],{},[36,373,374],{},"Check 4: Application errors",[76,376,378,384,390],{"className":377},[79],[33,379,381,383],{"className":380},[83],[85,382],{"disabled":87,"type":88}," Check application logs for the time the outage started",[33,385,387,389],{"className":386},[83],[85,388],{"disabled":87,"type":88}," Look for: stack traces, connection errors, timeout messages, OOM kills",[33,391,393,395],{"className":392},[83],[85,394],{"disabled":87,"type":88}," Check your error tracker (Sentry, Datadog, etc.) for new error types that appeared at the outage start time",[12,397,398],{},[36,399,400],{},"Check 5: Database",[76,402,404,410,416],{"className":403},[79],[33,405,407,409],{"className":406},[83],[85,408],{"disabled":87,"type":88}," Can the application connect to the database?",[33,411,413,415],{"className":412},[83],[85,414],{"disabled":87,"type":88}," Are there long-running queries blocking normal operations?",[33,417,419,421],{"className":418},[83],[85,420],{"disabled":87,"type":88}," Is the connection pool exhausted?",[217,423,427],{"className":424,"code":425,"language":426,"meta":225,"style":225},"language-bash shiki shiki-themes material-theme-lighter material-theme material-theme-palenight","# PostgreSQL connection count\npsql -c \"SELECT count(*) FROM pg_stat_activity;\"\n\n# Long-running queries\npsql -c \"SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;\"\n","bash",[97,428,429,437,458,464,470],{"__ignoreMap":225},[105,430,433],{"class":431,"line":432},"line",1,[105,434,436],{"class":435},"sHwdD","# PostgreSQL connection count\n",[105,438,440,444,448,452,455],{"class":431,"line":439},2,[105,441,443],{"class":442},"sBMFI","psql",[105,445,447],{"class":446},"sfazB"," -c",[105,449,451],{"class":450},"sMK4o"," \"",[105,453,454],{"class":446},"SELECT count(*) FROM pg_stat_activity;",[105,456,457],{"class":450},"\"\n",[105,459,461],{"class":431,"line":460},3,[105,462,463],{"emptyLinePlaceholder":87},"\n",[105,465,467],{"class":431,"line":466},4,[105,468,469],{"class":435},"# Long-running queries\n",[105,471,473,475,477,479,482],{"class":431,"line":472},5,[105,474,443],{"class":442},[105,476,447],{"class":446},[105,478,451],{"class":450},[105,480,481],{"class":446},"SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;",[105,483,457],{"class":450},[12,485,486],{},[36,487,488],{},"Check 6: DNS and SSL",[76,490,492,501],{"className":491},[79],[33,493,495,497,498],{"className":494},[83],[85,496],{"disabled":87,"type":88}," Does the domain resolve? ",[97,499,500],{},"dig yourdomain.com",[33,502,504,506,507],{"className":503},[83],[85,505],{"disabled":87,"type":88}," Is the SSL certificate valid? ",[97,508,509],{},"echo | openssl s_client -connect yourdomain.com:443 2>\u002Fdev\u002Fnull | openssl x509 -noout -dates",[56,511],{},[63,513,515],{"id":514},"t20-first-status-update-if-unresolved","T+20: First Status Update (if unresolved)",[12,517,518],{},"Post an update to the status page even if you have not resolved the issue. Keep your stated update commitment.",[12,520,521],{},"If you have found the cause:",[239,523,524,531,538,545],{},[12,525,526],{},[36,527,528,529],{},"Issue Identified - ",[105,530,248],{},[12,532,533,534,537],{},"We have identified the cause: ",[105,535,536],{},"one plain-language sentence about what went wrong",".",[12,539,540,541,544],{},"We are working on a fix. ",[105,542,543],{},"Specific features or flows"," remain affected.",[12,546,258,547],{},[105,548,261],{},[12,550,551],{},"If you have not found the cause:",[239,553,554,560,563],{},[12,555,556],{},[36,557,245,558],{},[105,559,248],{},[12,561,562],{},"We continue to investigate. Engineers are actively working to identify and resolve the issue.",[12,564,258,565],{},[105,566,261],{},[12,568,569],{},"Post the same update to the incident channel.",[56,571],{},[63,573,575],{"id":574},"t30-escalation-decision-point","T+30: Escalation Decision Point",[12,577,578],{},"If the incident is not resolved or clearly on the path to resolution within 30 minutes, escalate.",[76,580,582,588,594],{"className":581},[79],[33,583,585,587],{"className":584},[83],[85,586],{"disabled":87,"type":88}," Is a second engineer needed? Page them.",[33,589,591,593],{"className":590},[83],[85,592],{"disabled":87,"type":88}," Does the CEO or a customer-facing team member need to know? Brief them in 2 sentences in a separate channel. Do not add them to the incident channel unless they can help resolve the issue.",[33,595,597,599],{"className":596},[83],[85,598],{"disabled":87,"type":88}," Does a vendor need to be contacted? Open a support ticket with them now, not after the incident.",[12,601,602],{},"Escalation is not failure. Sitting on an unresolved SEV-1 for 45 minutes without escalating because you feel like you should be able to solve it alone is failure.",[56,604],{},[63,606,608],{"id":607},"resolution-service-restored","Resolution: Service Restored",[12,610,611],{},"When monitoring confirms recovery:",[76,613,615,621],{"className":614},[79],[33,616,618,620],{"className":617},[83],[85,619],{"disabled":87,"type":88}," Wait 5 minutes after the monitoring reports green before declaring resolved. Premature resolution declarations followed by a second failure are worse than staying in \"monitoring\" state.",[33,622,624,626],{"className":623},[83],[85,625],{"disabled":87,"type":88}," Update the status page to Resolved:",[239,628,629,636,647],{},[12,630,631],{},[36,632,633,634],{},"Resolved - ",[105,635,248],{},[12,637,638,639,642,643,646],{},"This incident is resolved. ",[105,640,641],{},"Service"," is fully operational as of ",[105,644,645],{},"time"," UTC.",[12,648,649,650,653,654,657],{},"Duration: ",[105,651,652],{},"X hours Y minutes",". Cause: ",[105,655,656],{},"one honest sentence",". We will publish a post-incident review within 48 hours.",[76,659,661,670,676],{"className":660},[79],[33,662,664,666,667,669],{"className":663},[83],[85,665],{"disabled":87,"type":88}," Post to ",[97,668,99],{}," with resolution time and who worked it",[33,671,673,675],{"className":672},[83],[85,674],{"disabled":87,"type":88}," Post to the incident channel with the resolution summary and a note that the channel will be archived",[33,677,679,681],{"className":678},[83],[85,680],{"disabled":87,"type":88}," Schedule the postmortem within 48 hours",[56,683],{},[63,685,687],{"id":686},"t2-hours-customer-email-decision","T+2 Hours: Customer Email Decision",[12,689,690],{},"Send a customer email for:",[76,692,693,696],{},[33,694,695],{},"Any SEV-1 incident",[33,697,698],{},"Any SEV-2 incident lasting over 30 minutes",[12,700,701],{},"Do not send a customer email for:",[76,703,704,707],{},[33,705,706],{},"Outages under 10 minutes with no confirmed user impact",[33,708,709],{},"SEV-3 incidents affecting a small subset of users",[12,711,712,713,718],{},"See ",[714,715,717],"a",{"href":716},"\u002Fblog\u002Fincident-communication-templates","Incident Communication Templates"," for email copy.",[56,720],{},[22,722,724],{"id":723},"incident-timeline-log-template","Incident Timeline Log Template",[12,726,727],{},"Copy this into the incident channel at the start of every significant incident:",[217,729,732],{"className":730,"code":731,"language":222},[220],"INCIDENT TIMELINE\n-----------------\n[time] UTC: Alert fired. [Monitor name] confirmed down from [regions].\n[time] UTC: Incident opened. IC: @name. Severity: SEV-[level].\n[time] UTC: Status page updated: Investigating.\n[time] UTC: [Action taken \u002F hypothesis tested \u002F finding]\n[time] UTC: [Action taken \u002F hypothesis tested \u002F finding]\n[time] UTC: Root cause identified: [description]\n[time] UTC: Fix deployed: [description]\n[time] UTC: Monitoring: watching for recovery.\n[time] UTC: Resolved. Recovery confirmed.\n",[97,733,731],{"__ignoreMap":225},[12,735,736],{},"Filling this in as the incident progresses takes 10 seconds per entry. Having it saves 2-3 hours of postmortem reconstruction.",[56,738],{},[22,740,742],{"id":741},"after-the-incident-48-hour-checklist","After the Incident: 48-Hour Checklist",[76,744,746,752,758,764,770],{"className":745},[79],[33,747,749,751],{"className":748},[83],[85,750],{"disabled":87,"type":88}," Postmortem written while details are fresh",[33,753,755,757],{"className":754},[83],[85,756],{"disabled":87,"type":88}," Action items assigned with owners and due dates",[33,759,761,763],{"className":760},[83],[85,762],{"disabled":87,"type":88}," Customer email sent (if applicable)",[33,765,767,769],{"className":766},[83],[85,768],{"disabled":87,"type":88}," Monitoring alert thresholds reviewed - could detection have been faster?",[33,771,773,775],{"className":772},[83],[85,774],{"disabled":87,"type":88}," Runbook updated - did anything in this runbook not fit the actual incident?",[22,777,779],{"id":778},"related-guides","Related Guides",[76,781,782,788,794,798,804],{},[33,783,784],{},[714,785,787],{"href":786},"\u002Fblog\u002Fincident-response-checklist-startups","Incident Response Checklist for Startups",[33,789,790],{},[714,791,793],{"href":792},"\u002Fblog\u002Fhow-to-communicate-during-service-outage","How to Communicate During a Service Outage",[33,795,796],{},[714,797,717],{"href":716},[33,799,800],{},[714,801,803],{"href":802},"\u002Fblog\u002Fincident-postmortem-template","How to Write an Incident Postmortem",[33,805,806],{},[714,807,809],{"href":808},"\u002Fblog\u002Finstant-website-downtime-alerts","How to Get Instant Alerts When Your Website Goes Down",[811,812,813],"style",{},"html pre.shiki code .sHwdD, html code.shiki .sHwdD{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#546E7A;--shiki-default-font-style:italic;--shiki-dark:#676E95;--shiki-dark-font-style:italic}html pre.shiki code .sBMFI, html code.shiki .sBMFI{--shiki-light:#E2931D;--shiki-default:#FFCB6B;--shiki-dark:#FFCB6B}html pre.shiki code .sfazB, html code.shiki .sfazB{--shiki-light:#91B859;--shiki-default:#C3E88D;--shiki-dark:#C3E88D}html pre.shiki code .sMK4o, html code.shiki .sMK4o{--shiki-light:#39ADB5;--shiki-default:#89DDFF;--shiki-dark:#89DDFF}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}",{"title":225,"searchDepth":439,"depth":439,"links":815},[816,817,828,829,830],{"id":24,"depth":439,"text":25},{"id":60,"depth":439,"text":61,"children":818},[819,820,821,822,823,824,825,826,827],{"id":65,"depth":460,"text":66},{"id":123,"depth":460,"text":124},{"id":191,"depth":460,"text":192},{"id":233,"depth":460,"text":234},{"id":269,"depth":460,"text":270},{"id":514,"depth":460,"text":515},{"id":574,"depth":460,"text":575},{"id":607,"depth":460,"text":608},{"id":686,"depth":460,"text":687},{"id":723,"depth":439,"text":724},{"id":741,"depth":439,"text":742},{"id":778,"depth":439,"text":779},"tutorials","2026-06-17","A copy-ready incident response runbook for when your website goes down. Covers the first 60 minutes minute by minute: acknowledgment, diagnosis, communication, recovery, and handoff.","md",null,{},"\u002Fblog\u002Fwebsite-outage-response-runbook",11,{"title":5,"description":833},"blog\u002Fwebsite-outage-response-runbook","thZ-bgjAriV8Jr0IFb0IVzLfX22rXesSY-5EaGcnm7k",1782668047315]