On-Call Survival Guide: From First Alert to Postmortem

Most on-call guides cover what to set up. This one covers what to do when the alert fires.

The difference between a 15-minute incident and a 3-hour incident is usually not technical knowledge — it's process. Teams that recover quickly have a repeatable structure they follow under pressure. Teams that spiral have good intentions and no structure.

This guide covers the full incident arc: alert fires, you respond, you diagnose, you communicate, you fix it, you close it, and you make sure it doesn't happen again.

The First 5 Minutes

The first five minutes are the most chaotic. Don't try to fix the problem yet. Contain the chaos so fixing becomes possible.

Step 1: Acknowledge the alert (30 seconds)

Acknowledge in whichever tool fired the alert. This prevents duplicate response and signals to your team that someone is handling it. If you're in a rotation, this starts your response clock for SLA purposes.

Step 2: Post to the incident channel (1 minute)

Open your designated incident channel and post:

🔴 Investigating: [service name] / [alert name]
IC: @yourname
Status page updated: [link]

Don't wait until you know more. Post now. Your team can see the incident is being handled. Create a dedicated channel (#inc-2026-06-26-api-errors) so the technical thread stays separate from the main engineering channel.

Step 3: Update the status page (1 minute)

Post "Investigating" before you know the cause. Customers seeing an update within 3 minutes trust you more than customers who see nothing for 20 minutes, even with no new information. See incident communication templates for copy-ready status page text.

Step 4: Note the exact incident start time (30 seconds)

The timestamp when the alert fired (not when you acknowledged it) is the start time for your SLA calculations, your postmortem timeline, and your customer communication. Note it somewhere you won't lose it.

Step 5: Open the runbook if one exists

If there's a runbook for this alert, follow it before doing anything else. Runbooks exist because someone solved this problem before. Trust the documented process before going off-script, even if you have a strong intuition about the cause.

The Diagnosis Framework: DIME

When there's no runbook, work through this checklist in order. Most incidents have a cause in one of these four categories.

D — Deployments

What changed in the last 2 hours? A recent deployment is the most common cause of production incidents. Check your deployment log first, before looking anywhere else. The most expensive diagnostic mistakes happen when teams spend 30 minutes debugging application behavior when the cause is a config value that changed 90 minutes ago.

If a deployment is the likely cause: roll it back first, then verify, then investigate why the deployment caused the failure. Don't investigate while the incident is ongoing.

I — Infrastructure

Did anything change in the underlying infrastructure? Autoscaling events, database migrations, certificate rotations, new firewall rules, DNS changes. Cloud providers have their own status pages; check them in the first 10 minutes of any incident that touches their services.

M — Metrics

What do your metrics show? Look for the spike that correlates with the incident start time. Error rate, CPU, memory, database connections, request queue depth, external API latency. The metric that spiked at the exact time the first failure occurred is usually the signal you need.

E — External

Is a third-party dependency the root cause? Payment processor, email provider, CDN, authentication service, cloud infrastructure. Check vendor status pages before spending 30 minutes debugging your own code. Stripe, AWS, Cloudflare, and Twilio all have status pages; check them in the first 10 minutes of any incident that touches their services.

The 5-minute rule

Move to the next item on the DIME checklist if you've spent 5 minutes on a hypothesis without finding confirmation. Staring at the same logs longer doesn't produce new information. A systematic switch to the next category usually does.

When to escalate

Escalate after 15 minutes without identifying a root cause. The cost of waking someone up is lower than the cost of 45 more minutes of solo diagnosis. When you escalate:

State what you've ruled out (not just what you've tried)
Share exact error messages, not your interpretation of them
Include the relevant timestamps
State your current hypothesis, if any

Communication During the Incident

Communication during an incident is a separate skill from debugging. The on-call engineer should not be doing both simultaneously. When a second person is available, assign one person to technical diagnosis and one person to communication.

The 15-minute update rule

Post a status update to the status page and to the internal incident channel every 15 minutes, without exception. Even when there's nothing new to report. "Still investigating, no changes to report" is a valid update. Silence is not.

Customers and stakeholders who see no update for 30 minutes assume the team is either not actively working on it or hiding something. Neither is a good impression to create during a production outage.

Severity levels

Level	Description	Response time	External communication
P1	Full outage, all users affected	Immediate	Status page + customer email
P2	Major feature broken, significant user impact	Within 5 min	Status page update
P3	Minor feature broken, small user subset	Within 30 min	Status page if customer-visible
P4	Internal tools, no customer impact	Business hours	Internal only

Classify at the start of the incident and adjust if the scope changes. Misclassifying a P1 as a P2 delays communication and escalation.

Closing the Incident

Before declaring resolved

Confirm the fix is deployed and has been running stably for at least 5 minutes
Verify error rate has returned to baseline — not just improved, but returned
Check from multiple regions if your monitoring supports it
Confirm no secondary failures have appeared

A false "resolved" declaration followed by a second failure is worse than staying in "monitoring" status longer. Users who see an outage end and then resume 10 minutes later lose significantly more trust than users who see an extended monitoring window.

After declaring resolved

Post the resolution to your status page and your incident channel. Within 2 hours, send a customer email for P1 and major P2 incidents. Use the templates in incident communication templates.

Writing Runbooks That Get Used

A runbook is only as useful as it is usable under pressure. At 3 AM with adrenaline running, a 5-page document is not useful. A 10-step checklist with specific commands is.

The minimal runbook structure

# [Service Name] — [Alert Name]

What this alert means: [1 sentence]
Escalate to @name if not resolved in 15 minutes

## Step 1: Check [X]
Command: [exact command or link]
Expected output: [what normal looks like]
If abnormal: [next step or escalate]

## Step 2: Check [Y]
Command: [exact command or link]
Expected output: [what normal looks like]
If abnormal: [next step or escalate]

## Step 3: Escalate
Page: @name (primary), @name (backup)
Include: what you've ruled out, error messages, timestamps

Three to five steps, exact commands, explicit escalation path. Everything else goes in a linked architecture doc, not in the runbook itself.

What to leave out

Background and history (link to a separate doc)
"Check the obvious things" — be specific about which things
Long prose explanations — use numbered steps and commands
Why a decision was made — save that for the postmortem

Which alerts need runbooks

Every P1 and P2 alert. Any alert that caused more than 20 minutes of investigation before. Any alert where the on-call engineer needed to ask a colleague what to do.

If you've had 10 incidents in the past year with no runbooks, you've paid to figure out each one from scratch 10 times. Write the runbook once.

The On-Call Setup Checklist

Before your rotation starts:

Alert routing reaches your phone, not just email
Escalation path documented: who is paged if you don't respond in 5 minutes
Monitoring dashboard bookmarked and accessible on mobile
Status page access confirmed from your phone
Runbooks for your top 5 alerts are accessible — not "somewhere in the wiki"
Incident channel designated or ready to create
Current deployments noted: anything shipped in the last 48 hours
Production access confirmed: you can run queries, restart services, SSH if needed

After your rotation ends:

Open issues documented and handed to the next on-call
Postmortems for incidents during your rotation completed or scheduled
Missing runbooks added to backlog
Alert threshold issues flagged for improvement

Building a Healthier On-Call Culture

On-call is a tax on engineering teams. High-functioning teams minimize it. Low-functioning teams normalize it.

Signs the on-call rotation is unsustainable:

More than 2–3 pages per week per on-call engineer
Pages firing between midnight and 6 AM more than once a week
Engineers muting alert channels
Turnover correlated with on-call rotation

The fix is almost always the same four changes:

Audit alert thresholds. Most teams have monitors set to alert on single failures from single locations. Requiring 2 consecutive failures from multiple regions eliminates the majority of false positives without meaningfully delaying real incident detection.
Add multi-region consensus. A single-region monitor that sees a transient routing issue fires an alert that wakes someone up for a problem that self-resolved in 30 seconds. Multi-region consensus means an alert only fires when multiple independent probe locations all confirm the failure.
Write runbooks for the 5 most common alerts. Teams that have runbooks recover faster and escalate less. The time investment in writing a runbook is paid back in the first incident that uses it.
Hold blameless postmortems. Teams that run postmortems fix systemic causes instead of just patching symptoms. The same incident type stops recurring.

The end state is an on-call rotation where real incidents are uncommon, detection is fast, runbooks exist for known failure patterns, and the engineer on call can sleep.

Quick Reference Card

Keep this accessible during incidents.

Alert fires
    │
    ├─ Acknowledge in alerting tool
    ├─ Post to #incidents (template: 🔴 Investigating...)
    ├─ Update status page ("Investigating")
    ├─ Note incident start time
    │
    └─ Runbook exists for this alert?
           │
           ├─ YES: Follow it
           │
           └─ NO: DIME checklist
                    │
                    ├─ D: Recent deployment? → Roll back, then investigate
                    ├─ I: Infrastructure change? → Revert if possible
                    ├─ M: Metrics spike? → Correlate with incident start time
                    └─ E: External dependency? → Check vendor status page
                    │
                    └─ 15 min, no root cause?
                           └─ Escalate: what you've ruled out +
                              exact errors + timestamps + hypothesis

For status page update text, customer email templates, and Slack announcement copy, see incident communication templates.

For writing the postmortem after the incident, see how to write an incident postmortem.

Frequently Asked Questions

What's the difference between an incident commander and the on-call engineer?

The on-call engineer is paged first and does the initial diagnosis. The incident commander (IC) coordinates the response once more people are involved — tracking progress, managing communication, deciding on escalation. For small teams, one person often fills both roles. For larger incidents, separating them prevents the technical diagnosis from being interrupted by communication tasks.

How long should I try to diagnose before escalating?

15 minutes. If you haven't identified the root cause in 15 minutes, the next set of eyes will almost always make a difference. The cost of waking someone up is real but bounded. The cost of a P1 running 45 minutes longer because you didn't want to wake anyone up is higher.

Should I roll back a deployment before fully diagnosing the cause?

Yes, for P1 incidents. Roll back first, verify recovery, then investigate why the deployment caused the failure. The priority during an active incident is restoring service, not understanding the root cause. The postmortem is for understanding.

How do I handle an incident I've never seen before?

Work through DIME systematically. If you've exhausted DIME without a hypothesis, escalate with everything you've found. "I've ruled out recent deployments, infrastructure changes, and our external dependencies. Metrics show a sharp increase in database connection errors starting at 14:32. I don't have a hypothesis yet" is a complete escalation message.

What makes on-call sustainable long-term?

Low false positive rate (under 1 per week), runbooks for the most common alerts, fast detection that limits incident duration, and blameless postmortems that prevent recurrence. Teams that address these four things rarely have retention problems from on-call.