The Anatomy of a Real Incident Detection System

The Gap Between "Check Failed" and "Engineer Paged"

When people think about monitoring, they imagine two states: up and down. The check passes or it doesn't. If it doesn't, someone gets a notification.

In reality, there's an entire pipeline between a failed check and a human being alerted, and most of the engineering quality lives in that pipeline. Every step is a decision point where false positives can leak through or real incidents can be missed.

This post walks through that pipeline end to end, as a sequence of concrete engineering decisions that determine whether your team gets paged at 3 AM for nothing or paged only when something is genuinely broken.

Step 1: The Check Runs

Everything starts with a probe sending a request. This seems trivial, but the details matter.

What gets sent

The probe constructs an HTTP request (or TCP connection, or DNS query, or ICMP ping, depending on the monitor type) and sends it to the target. For HTTP monitors, the request includes method, headers, body, and follow-redirect behavior, all configurable per monitor.

The probe starts a timer. Multiple timers, actually:

DNS resolution start/end
TCP connection start/end
TLS handshake start/end
Time to first byte
Full response received

Each timer produces a metric. The sum is "response time," but the components are more useful for diagnosis.

What gets recorded

The probe captures the full result:

HTTP status code
Response headers (selected set, not all: storing every header for every check at 30-second intervals would be expensive and mostly useless)
Response body (first N bytes, for keyword validation)
TLS certificate details (expiry date, issuer, chain validity)
Each timing phase
Probe ID, region, and timestamp

This result object is the raw material for everything that follows. If this data is lossy or incomplete, every downstream decision is compromised.

Step 2: Result Classification

The raw result gets classified into one of several states. This is the first decision point, and it's where most monitoring tools are too simplistic.

The simple model (what most tools do)

HTTP status 2xx → UP
Everything else → DOWN

This is wrong for a surprising number of cases. A 301 redirect might be intentional. A 503 from a load balancer during a deploy might last 2 seconds. A 200 response with an error message in the body is not "up."

A better model

HTTP 2xx + body validation passes  → UP
HTTP 2xx + body validation fails   → CONTENT_MISMATCH (degraded)
HTTP 3xx                           → REDIRECT (depends on config)
HTTP 4xx                           → CLIENT_ERROR (may or may not be a problem)
HTTP 5xx                           → SERVER_ERROR (likely a problem)
Timeout                            → TIMEOUT (may be network, may be server)
Connection refused                 → CONNECTION_REFUSED (server or firewall)
DNS resolution failed              → DNS_FAILURE (may be resolver, may be config)
TLS error                          → TLS_FAILURE (certificate or handshake issue)

Each classification carries different severity and requires different handling downstream. A DNS failure is a fundamentally different problem than a 500 error, even though both mean "the user can't reach the service."

Why this matters

If your monitoring tool only knows "up" and "down," it can't make nuanced decisions later in the pipeline. A timeout caused by a slow probe (not a slow server) gets the same treatment as a connection refused caused by a crashed service. The alerting system can't distinguish them because the classification system didn't.

Rich classification at the check level enables better decisions at the alerting level. Strip it to up/down and your alerting will be just as blunt.

Step 3: Consensus Verification

This is the step that most monitoring tools skip entirely. A single failed check triggers a single alert. No verification. No second opinion.

In a consensus-based system, a failed check triggers a verification round:

The verification protocol

Primary probe reports a failure. Probe A in Frankfurt checks api.example.com and gets a timeout.
Coordination layer receives the failure. Instead of immediately creating an alert, the coordination layer dispatches verification requests to probes in other regions.
Verification probes check independently. Probe B in Virginia and Probe C in Singapore each send their own request to api.example.com, using their own DNS resolution, their own TLS handshake, their own network path.
Results are collected within a time window. The coordination layer waits for all verification results (with a timeout: if a verification probe doesn't respond in time, its vote is discarded, not counted as a failure).
Consensus is computed. If the majority of probes report failure, the target is classified as DOWN. If only the original probe reports failure, the event is classified as a PATH_ISSUE and no alert is generated.

Edge cases in consensus

Two out of three probes fail. This is ambiguous. Is it a regional outage affecting two probes, or is the target genuinely down and one probe got lucky? We treat 2/3 as a confirmed failure, because the probability of two independent network paths failing simultaneously while the target is healthy is very low. But we log the dissenting probe's result for investigation.

All probes fail, but with different error types. Probe A gets a timeout, Probe B gets a connection refused, Probe C gets a DNS failure. The target is down, but the inconsistent error types suggest multiple failure modes, possibly a DNS change propagating at different speeds across resolver caches. The alert includes the per-probe error details so the engineer can diagnose the root cause.

Probe verification itself fails. The coordination layer sends a verification request to Probe B, but Probe B doesn't respond. This could mean Probe B is down, Probe B's network is degraded, or the coordination message was lost. The system falls back to the remaining probes. If the remaining probes can't form a majority, the check is retried after one interval rather than generating an alert with insufficient data.

Step 4: State Machine Transition

The consensus result feeds into a per-monitor state machine. This is where transient failures are filtered from sustained outages.

The states

HEALTHY     → The target has been consistently reachable
SUSPICIOUS  → One or more recent failures, but not yet confirmed
DEGRADED    → Target is reachable but slow or returning unexpected content
DOWN        → Target is confirmed unreachable from multiple regions
RECOVERING  → Target was down and has started responding again

Transition rules

HEALTHY → SUSPICIOUS: A single consensus-confirmed failure moves the monitor to SUSPICIOUS. No alert is generated yet. The next check will determine if this is a blip or the start of an outage.

SUSPICIOUS → DOWN: A second consecutive consensus-confirmed failure moves the monitor to DOWN. An alert is generated. This two-failure requirement adds one check interval of delay (30 seconds to 1 minute) but filters out the vast majority of transient failures.

SUSPICIOUS → HEALTHY: If the check following a SUSPICIOUS result passes, the monitor returns to HEALTHY. No alert was generated. No human was disturbed. The system absorbed a transient failure silently.

DOWN → RECOVERING: A successful check while in DOWN state moves the monitor to RECOVERING. No recovery alert yet; we want to confirm the recovery is sustained.

RECOVERING → HEALTHY: A second consecutive successful check confirms recovery. A recovery notification is sent. The incident is closed.

RECOVERING → DOWN: If the check fails again during RECOVERING, the monitor returns to DOWN. No new alert is generated (the existing incident is still open). This prevents the notification storm that happens with flapping services: down, up, down, up, each transition generating a notification.

Why a state machine, not just thresholds

Threshold-based alerting ("alert after 3 failures") doesn't capture the temporal dynamics of real outages. A state machine models the actual lifecycle of an incident: detection, confirmation, duration, recovery, and recovery confirmation. Each transition has specific behavior attached to it, and the model handles edge cases (flapping, partial recovery, intermittent failures) without special-case code.

Step 5: Incident Creation

When the state machine transitions to DOWN, an incident is created. An incident is not the same as an alert. It's a container for the entire lifecycle of an outage.

What an incident captures

Start time: The timestamp of the first failed check that eventually led to confirmed downtime (not the timestamp of the confirmation, the actual start)
Affected monitor(s): Which endpoints are impacted
Error details: Per-probe error types, response codes, timing data
Region breakdown: Which probes see the failure and which don't
Timeline: Every state transition, every check result, every action taken

Deduplication

If multiple monitors for the same infrastructure go down simultaneously, they likely share a root cause. A payment API and a user API on the same server cluster going down at the same moment should be one incident, not two.

Incident deduplication is imperfect: you can't always know that two monitors share infrastructure. But temporal correlation (multiple monitors entering DOWN state within the same 60-second window) is a strong signal that they're related. We group them into a single incident with multiple affected monitors, which means one alert to the team instead of five.

Step 6: Alert Routing

An incident has been created. Now: who gets told, how, and when?

Alert policies

Different monitors have different criticality. A production payment endpoint going down should trigger a phone call at 3 AM. A staging environment going down should send a Slack message during business hours.

Alert policies map monitors to notification channels and escalation rules:

Production API    → Immediate: SMS + Phone + Slack
                  → 5 min no ack: Escalate to engineering lead
                  → 15 min no ack: Escalate to CTO

Staging           → Slack only, no escalation

Marketing site    → Email + Slack, escalate after 30 min

Channel selection

Each notification channel has different characteristics:

Channel	Latency	Intrusiveness	Reliability
Phone call	5–15 sec	Very high	High (carrier network)
SMS	5–30 sec	High	High
Slack/Discord	1–5 sec	Medium	Dependent on Slack's uptime
Email	10–60 sec	Low	High but slow
Webhook	<1 sec	N/A (machine-to-machine)	Dependent on receiver

For critical incidents, you want high-intrusiveness channels (phone, SMS) because the goal is to interrupt whatever the engineer is doing. For informational alerts, you want low-intrusiveness channels (email, Slack) because the goal is to inform without disrupting.

The meta-problem: notification service reliability

Your monitoring tool needs to send an SMS. The SMS provider is having an outage. Your engineer doesn't get paged. The incident goes unnoticed.

This is a real failure mode. We mitigate it by using multiple providers per channel (primary + fallback for SMS, primary + fallback for email) and by sending to multiple channels simultaneously for critical alerts. If SMS fails, the phone call still goes through. If Slack is down, the email still arrives.

The worst possible outcome for a monitoring system is: it detected the problem correctly, created the incident correctly, decided to alert correctly, and then failed to deliver the alert. Every notification must be treated as a critical path operation with its own redundancy.

Step 7: Escalation

An alert has been sent. The clock starts. If nobody acknowledges the alert within the escalation window, the system escalates.

Acknowledgment

Acknowledgment means: a human has seen this alert and is investigating. It doesn't mean the problem is fixed. It means someone is on it. The purpose is to stop the escalation chain.

Without acknowledgment, the system assumes the alert wasn't received (phone was silenced, engineer is asleep, SMS didn't deliver). It escalates to the next person in the rotation.

Escalation logic

T+0:   Alert to on-call primary
T+5m:  No ack → Alert to on-call secondary
T+15m: No ack → Alert to engineering manager
T+30m: No ack → Alert to all engineers (broadcast)

The timeframes and escalation targets are configurable per alert policy. The principle: if nobody responds within a reasonable window, widen the net until someone does.

What makes escalation hard

The hard part isn't the escalation logic; it's the on-call rotation data. Who is on-call right now? What's their phone number? Did they swap shifts with someone last Tuesday? Are they in a timezone where it's 3 AM or 3 PM?

On-call data is some of the most operationally critical data in your entire system, and it changes constantly. A stale on-call rotation means alerts go to the wrong person. The wrong person either doesn't respond (because they think someone else is on-call) or does respond but doesn't have the context to act effectively.

Step 8: Recovery and Postmortem

The engineer fixes the problem. The next check passes. The check after that passes too. The state machine transitions from RECOVERING to HEALTHY. A recovery notification is sent. The incident is closed.

Recovery notification

Recovery messages should include:

Total downtime duration
Root cause category (if determinable from check data)
Link to the full incident timeline

What recovery messages should not do: generate the same level of intrusiveness as the initial alert. If the initial alert was a phone call, the recovery should be a Slack message or email. The engineer already knows they fixed it. They don't need a phone call confirming what they just did.

Incident data for postmortems

A well-instrumented detection pipeline provides the raw material for postmortem analysis:

Exact start time: Down to the second, not "approximately 2 PM"
Detection time: How long between the outage starting and the first alert
Acknowledgment time: How long until a human saw the alert
Resolution time: How long from acknowledgment to recovery
Affected scope: Which monitors, which regions, which users
Check-by-check timeline: Every result, from every probe, during the incident window

This data transforms postmortems from "we think it was down for about 20 minutes" to "the outage started at 14:03:17, was detected at 14:03:47, acknowledged at 14:05:12, and resolved at 14:18:33. Here's every check result during that window."

Precision in postmortem data drives precision in improvement. Vague data produces vague action items. Specific data produces specific fixes.

The Full Pipeline

Putting it all together:

Check runs (30s interval, from probe region)
    ↓
Result classified (UP / TIMEOUT / 5xx / DNS_FAILURE / TLS_ERROR / ...)
    ↓
Consensus verification (re-check from 2+ additional regions)
    ↓
State machine transition (HEALTHY → SUSPICIOUS → DOWN)
    ↓
Incident created (with deduplication)
    ↓
Alert routed (per policy: channel + escalation)
    ↓
Escalation (if no acknowledgment within window)
    ↓
Recovery confirmed (2 consecutive passes)
    ↓
Incident closed (timeline preserved for postmortem)

Nine steps. Each one is where monitoring tools either get it right or get it wrong.

Most monitoring tools implement steps 1, 2, and 6: run a check, classify as up or down, send an alert. Steps 3 through 5 and 7 through 9 are where the engineering investment determines whether you get woken up for nothing or woken up for something that matters.

Why This Matters

The pipeline described above isn't theoretical. It's what runs in production at Vantaj, processing millions of checks per day.

We didn't build all of it on day one. The early versions were simpler, and noisier. Each step was added because a specific class of bad alerts kept leaking through, or a specific class of real incidents kept being missed.

Consensus was added because single-region false positives were eroding trust. The state machine was added because transient blips were generating alerts that resolved before anyone could investigate. Incident deduplication was added because correlated failures were generating five Slack messages instead of one. Recovery confirmation was added because flapping services were sending alternating down/up notifications every minute.

Every step in the pipeline exists because the previous version of the pipeline had a failure mode that the step eliminates.

A real incident detection system is a pipeline of decisions, each one filtering noise from signal. The only question any of it answers: is this worth waking someone up for?