Back to blog
Tutorials

Incident Response Checklist for Startups: From Zero to Production-Ready

Most startups have no incident process until the first major outage. This checklist covers everything you need to build one: monitoring, on-call, status pages, severity levels, and postmortems.

Vantaj Team · February 22, 2026 · 12 min read

The first major production outage at a startup hits differently when there is no process. The alert fires at 1 AM. Nobody knows who is supposed to respond. The person who does respond spends 20 minutes figuring out what to look at. Customers are tweeting. The status page is empty. The CEO is awake.

That first bad incident is expensive in ways that compound: customer churn, trust damage, and now the anxiety of knowing your next incident will look the same.

This checklist builds the minimum viable incident response program for a startup. Not enterprise ITSM. Not 40-page runbooks. The smallest set of things that makes a serious incident manageable rather than chaotic.


Phase 1: Detection Foundation

You cannot respond to an incident you do not know about. Detection is the highest-leverage investment in incident response.

Monitoring setup checklist

  • HTTP uptime monitoring on every production endpoint
    • Homepage
    • Login / auth endpoint
    • API health check (/health or /api/health)
    • Checkout or payment flow (if applicable)
    • Any endpoint in your SLA or customer contract
  • Check interval: 1 minute for production services, 5 minutes for non-critical
  • Multi-region verification: checks should confirm failures from multiple locations before alerting. Single-region checks fire false positives on network path issues that have nothing to do with your service.
  • SSL certificate monitoring with alerts at 30 and 14 days before expiry
    • A surprising number of startup outages are expired certificates. Let's Encrypt auto-renewal silently fails more often than you'd expect.
  • Domain expiry monitoring at 60 and 30 days before expiry
    • Domain expiry takes your entire service offline with no warning from your infrastructure.
  • Heartbeat monitoring for every cron job or background worker that customers depend on
    • If your billing job fails silently for a week, you find out from customers, not your monitoring.
  • Third-party vendor monitoring for dependencies your customers depend on
    • Stripe, Auth0, Twilio, SendGrid, Cloudflare - their outages become your incidents.

Alert channel setup

  • Slack (or Teams) channel dedicated to incidents: #incidents or #alerts
  • Email alert to a shared group address, not an individual
  • SMS or phone call for SEV-1 on critical services (stops overnight outages from going undetected until morning)

Detection quality check

Run this monthly:

QuestionTarget
How long would it take us to know the homepage is down right now?Under 2 minutes
Would we know if the checkout API broke at 3 AM?Yes
When does our SSL certificate expire?Know the date
When does our domain registration expire?Know the date
Would we know if our daily billing job failed silently?Yes

If you cannot answer "yes" to any of these, you have a monitoring gap that will eventually become an incident you find out about from customers.


Phase 2: On-Call Structure

Define who is on call

The minimum viable on-call structure:

  • Primary on-call: one engineer responsible for responding to production alerts this week
  • Escalation path: who to call if the primary does not respond within 10 minutes
  • Written rotation schedule: not a mental model, an actual doc or calendar entry

For teams under 5 engineers: a simple rotating weekly schedule is enough. One person per week. The same person handles both day and night alerts for that week.

For teams over 5 engineers: two-person primary/secondary rotation. Primary responds first; secondary gets paged if primary does not acknowledge within 10-15 minutes.

Escalation policy setup

Configure in your monitoring tool:

  • Primary contact notified immediately on confirmed failure
  • Secondary contact paged after 10 minutes of no acknowledgment
  • Team lead or founder paged after 20 minutes of no acknowledgment on SEV-1

On-call sustainability check

On-call that burns people out stops working. Track:

  • Alerts per week per on-call engineer (healthy: under 5; review alert tuning if consistently over 15)
  • After-hours pages per week (more than 2-3 per week signals a reliability problem, not an on-call problem)
  • False positive rate (every false positive erodes trust in the alert system)

Phase 3: Severity Classification

Write this down before the first major incident. Teams that debate severity during an active incident waste the first 5-10 minutes on classification instead of response.

Minimum viable severity definitions

SeverityCriteriaResponseCommunication
P1Full outage. All users affected. Data loss risk.Immediate, wake up if neededStatus page within 5 min. Customer email post-resolution.
P2Partial outage or significant degradation. Major feature unavailable.Within 15 minutesStatus page within 15 min.
P3Minor issue. Small subset of users or edge case.Within 2 hoursInternal tracking only.
  • Severity definitions written and accessible to every engineer (not in someone's head)
  • Link to the severity doc in your #incidents channel description

Phase 4: Status Page

A status page is not optional for a product with paying customers. It is the single highest-ROI communication investment in incident response.

Without a status page, every outage generates support tickets from customers who have no other way to check status. With a status page, the ticket volume during an incident drops by 60-80%.

Status page checklist

  • Public status page URL exists and is accessible without login
  • URL listed in your app's footer, docs, and support articles
  • Every production service or component listed as a page component
  • Alert contacts configured so the status page can be updated from your phone during an incident
  • Subscriber notifications enabled (customers can subscribe for email/SMS alerts)

Status page update habit

The status page only works if you use it. Build the habit during the first P1:

  • "Investigating" post within 5 minutes of incident declaration
  • Updates every 20 minutes until resolved
  • "Resolved" post with duration and one-sentence cause

Phase 5: Runbook for Common Failures

A runbook is a documented response procedure for a specific type of incident. It answers: when this alert fires, what do I check first?

You do not need runbooks for everything. Write them for:

  • Runbook for "service down / 5xx errors" (check deployment history, server health, DB connections)
  • Runbook for "high error rate but service up" (check DB, check third-party dependencies, check recent deploy)
  • Runbook for "SSL certificate expiry alert" (renewal steps, who has cert access)
  • Runbook for "cron job missed heartbeat" (check job logs, check server it runs on, manual trigger steps)

Runbook minimum structure:

## Alert: [Alert name]

### Immediate checks (do these first)
1. [Check 1 with command or URL]
2. [Check 2 with command or URL]
3. [Check 3 with command or URL]

### Common causes and fixes
- Cause A → Fix A
- Cause B → Fix B

### Escalate to: [name/role] if not resolved in 30 minutes

A runbook does not need to be long. Three checks and two common causes is enough to start.


Phase 6: Postmortem Habit

The postmortem is what separates teams that have the same incidents on repeat from teams that improve.

Write a postmortem after every P1 and every P2 that lasted over 30 minutes. Write it within 48 hours of resolution.

Postmortem minimum structure

  • What happened: 3-5 sentences, plain language
  • Timeline: chronological log from first symptom to resolution
  • Root cause: why it happened (use 5 Whys to find the actual cause, not the symptom)
  • Contributing factors: what made it worse or harder to detect
  • Action items: specific, assigned, with due dates - not "improve monitoring" but "add SSL expiry alert to Vantaj by July 5, @alice"

Action item tracking

Postmortems that do not produce completed action items are just documentation.

  • Action items in a shared doc or project management tool
  • Owner and due date assigned to each item
  • Review action item completion at the next postmortem

The Full Checklist (Summary)

Detection

  • HTTP uptime monitoring on all production endpoints at 1-minute intervals
  • SSL expiry monitoring with 30-day and 14-day alerts
  • Domain expiry monitoring with 60-day and 30-day alerts
  • Heartbeat monitoring for all cron jobs and background workers
  • Third-party vendor monitoring for key dependencies
  • Alert channels: Slack + email (group address) + SMS for P1

On-Call

  • Written rotation schedule
  • Escalation path configured in monitoring tool
  • False positive rate below 10% (if higher, tune alert thresholds)

Classification

  • P1/P2/P3 definitions written and accessible

Status Page

  • Public status page exists
  • URL in app footer and docs
  • All production services listed as components

Runbooks

  • "Service down" runbook
  • "High error rate" runbook
  • SSL expiry runbook
  • Cron job failure runbook

Postmortems

  • Written after every P1 and extended P2
  • Timeline, root cause, and action items in every postmortem
  • Action items tracked with owners and due dates

What to Build First

If you are starting from zero, build in this order:

  1. Uptime monitoring with alerting (takes 5 minutes with Vantaj; everything else depends on this)
  2. Status page (takes 3 minutes; saves hours of support ticket triage per incident)
  3. On-call rotation (a simple weekly schedule is enough to start)
  4. Severity definitions (one doc, one page)
  5. "Service down" runbook (covers most first-response scenarios)
  6. Postmortem after your first P1 (builds the habit)

Add runbooks, automation, and process depth after the first few incidents surface what your team actually needs. Do not design a process for incidents you have not had yet.