Website Outage Response Runbook: What to Do in the First 60 Minutes

When your website goes down, the first 10 minutes are chaotic. People flood Slack. Someone starts investigating. Someone else starts a different investigation. Nobody has told customers anything. Nobody has posted to the status page. The CEO is DMing the on-call engineer.

A runbook stops the chaos before it starts. It replaces "what do we do?" with a sequence your team follows every time, regardless of who is on call.

This is a copy-ready runbook template for the first 60 minutes of a production outage.

Before You Need This: Prerequisites

This runbook assumes three things are in place:

Uptime monitoring with alerting - you know about the outage from your monitoring tool, not from a customer tweet
A status page - a public URL where customers check service status
Defined severity levels - at minimum, a distinction between "all users affected" and "some users affected"

If any of these are missing, set them up before the next incident. Monitoring takes under 5 minutes to set up in Vantaj. Not having it is the single most expensive preparation gap.

The Runbook

T+0: Alert Fires

The alert lands in Slack (or via SMS, email, or phone call).

Do these three things in the first 2 minutes:

Open the monitoring dashboard. Note: which service, which regions, what error
Claim the incident: post in #incidents: "I'm on this. SEV-1/2/3." This single message prevents the situation where three people each assume someone else is handling it
Check if it is real: go directly to the affected URL. Confirm the error from your browser

If you cannot reproduce the error manually, check whether your monitoring uses multi-region consensus. If it does and it still fired, the outage is real. If it does not, you may have a false positive.

T+2: Severity Classification

Classify the incident. This determines the next steps.

Severity	Criteria	Response
SEV-1	Full outage, all users affected, or data loss risk	Wake everyone. Status page immediately.
SEV-2	Partial outage or degraded service, significant user impact	On-call investigates. Status page within 5 min.
SEV-3	Minor issue, small subset of users affected	On-call investigates. Internal tracking only.

When in doubt, classify up. A SEV-1 that turns out to be a SEV-2 is fine. A SEV-2 that was actually a SEV-1 costs you 20 minutes of delayed communication with customers.

T+3: Open the Incident Channel

Create a Slack channel: #inc-YYYYMMDD-[short-description] (e.g., #inc-20260628-api-down)
Post the incident brief in the channel:

🔴 INCIDENT OPEN

Service: [which service or endpoint]
Impact: [what users experience]
Severity: SEV-[1/2/3]
Incident Commander: @you
Started: [time] UTC
Monitoring link: [direct link to the failing monitor]
Status page: [link]

All incident discussion here only.

This channel becomes the incident timeline. Everything that happens, every hypothesis tested, every change made - post it here as it happens. You will need this log for the postmortem.

T+5: Update the Status Page

For SEV-1 and SEV-2, update the status page before you know the cause. Post this:

Investigating - Service Name
We are investigating reports of service being unavailable. Engineers are actively working on this.
Next update: T+20 from now

Customers who see this update stop filing support tickets. Support ticket volume during an acknowledged incident drops by 60-80% compared to a silent outage. You get your investigation time back.

T+5 to T+30: Diagnosis

Your goal in this window: identify the category of the problem. You do not need the root cause yet. You need enough to either restore service or escalate.

Work through this checklist in order. Skip steps you can verify quickly.

Check 1: Recent changes

Was there a deployment in the last 30 minutes? git log --oneline -10
Was there a config change, environment variable update, or infrastructure change?
If yes to either: rollback first, investigate second. Restoring service takes priority over understanding the cause.

Check 2: External dependencies

Check the status pages of your key dependencies: Stripe, Auth0, AWS, Cloudflare, Vercel, etc.
If a dependency is down and you use it in the affected flow: that is your cause. Update the status page with the dependency name.

Check 3: Server health

CPU: is it pegged?
Memory: is it near capacity?
Disk: is it full?

For each: top, free -h, df -h. On cloud providers, check your dashboard for resource graphs with the spike visible.

Check 4: Application errors

Check application logs for the time the outage started
Look for: stack traces, connection errors, timeout messages, OOM kills
Check your error tracker (Sentry, Datadog, etc.) for new error types that appeared at the outage start time

Check 5: Database

Can the application connect to the database?
Are there long-running queries blocking normal operations?
Is the connection pool exhausted?

# PostgreSQL connection count
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Long-running queries
psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"

Check 6: DNS and SSL

Does the domain resolve? dig yourdomain.com
Is the SSL certificate valid? echo | openssl s_client -connect yourdomain.com:443 2>/dev/null | openssl x509 -noout -dates

T+20: First Status Update (if unresolved)

Post an update to the status page even if you have not resolved the issue. Keep your stated update commitment.

If you have found the cause:

Issue Identified - Service Name
We have identified the cause: one plain-language sentence about what went wrong.
We are working on a fix. Specific features or flows remain affected.
Next update: T+20 from now

If you have not found the cause:

Investigating - Service Name
We continue to investigate. Engineers are actively working to identify and resolve the issue.
Next update: T+20 from now

Post the same update to the incident channel.

T+30: Escalation Decision Point

If the incident is not resolved or clearly on the path to resolution within 30 minutes, escalate.

Is a second engineer needed? Page them.
Does the CEO or a customer-facing team member need to know? Brief them in 2 sentences in a separate channel. Do not add them to the incident channel unless they can help resolve the issue.
Does a vendor need to be contacted? Open a support ticket with them now, not after the incident.

Escalation is not failure. Sitting on an unresolved SEV-1 for 45 minutes without escalating because you feel like you should be able to solve it alone is failure.

Resolution: Service Restored

When monitoring confirms recovery:

Wait 5 minutes after the monitoring reports green before declaring resolved. Premature resolution declarations followed by a second failure are worse than staying in "monitoring" state.
Update the status page to Resolved:

Resolved - Service Name
This incident is resolved. Service is fully operational as of time UTC.
Duration: X hours Y minutes. Cause: one honest sentence. We will publish a post-incident review within 48 hours.

Post to #incidents with resolution time and who worked it
Post to the incident channel with the resolution summary and a note that the channel will be archived
Schedule the postmortem within 48 hours

T+2 Hours: Customer Email Decision

Send a customer email for:

Any SEV-1 incident
Any SEV-2 incident lasting over 30 minutes

Do not send a customer email for:

Outages under 10 minutes with no confirmed user impact
SEV-3 incidents affecting a small subset of users

See Incident Communication Templates for email copy.

Incident Timeline Log Template

Copy this into the incident channel at the start of every significant incident:

INCIDENT TIMELINE
-----------------
[time] UTC: Alert fired. [Monitor name] confirmed down from [regions].
[time] UTC: Incident opened. IC: @name. Severity: SEV-[level].
[time] UTC: Status page updated: Investigating.
[time] UTC: [Action taken / hypothesis tested / finding]
[time] UTC: [Action taken / hypothesis tested / finding]
[time] UTC: Root cause identified: [description]
[time] UTC: Fix deployed: [description]
[time] UTC: Monitoring: watching for recovery.
[time] UTC: Resolved. Recovery confirmed.

Filling this in as the incident progresses takes 10 seconds per entry. Having it saves 2-3 hours of postmortem reconstruction.

After the Incident: 48-Hour Checklist

Postmortem written while details are fresh
Action items assigned with owners and due dates
Customer email sent (if applicable)
Monitoring alert thresholds reviewed - could detection have been faster?
Runbook updated - did anything in this runbook not fit the actual incident?