Back to blog
Tutorials

What is Uptime Monitoring? A Complete Guide (2026)

Uptime monitoring checks whether your websites, APIs, and services are available and responding correctly. This guide covers how it works, what to monitor, and what metrics actually matter.

Vantaj Team · June 24, 2026 · 12 min read

Uptime monitoring is the practice of continuously checking whether a website, API, or service is available and responding as expected. When a monitor detects a failure - the service is unreachable, responding with an error, or taking too long - it sends an alert to the responsible team.

The goal is simple: know about problems before your users do, and fix them faster.


How Uptime Monitoring Works

At its core, uptime monitoring is a loop:

  1. A probe (a server in a monitoring datacenter) sends a request to your endpoint
  2. It checks whether the response matches expectations (correct status code, expected content, acceptable response time)
  3. If the check passes: nothing happens, the result is logged
  4. If the check fails: the system triggers an alert and opens an incident

This loop repeats on a check interval - typically every 30 seconds to 5 minutes depending on your plan and requirements.

What a check actually does

For an HTTP monitor, the probe:

1. Resolves the DNS name of your endpoint
2. Establishes a TCP connection
3. Performs a TLS handshake (for HTTPS)
4. Sends an HTTP GET (or configured method)
5. Receives the response
6. Validates: status code, response time, optional body content
7. Records the result with a timestamp

Each step can fail independently. A DNS resolution failure looks different from a TLS handshake timeout, which looks different from a 500 response from your application - and good monitoring systems log enough detail to distinguish between them.


Types of Uptime Monitoring

HTTP/HTTPS Monitoring

The most common type. Sends a request to a URL and validates the response.

What you can check:

  • HTTP status code (200, 404, 500, etc.)
  • Response time / latency
  • Response body content (does it contain expected text?)
  • HTTP response headers
  • Redirect chain (does the URL redirect to the right place?)

Best for: Websites, API endpoints, SaaS applications, landing pages

SSL Certificate Monitoring

Checks whether your SSL/TLS certificate is valid and alerts before it expires.

What you can check:

  • Days until expiry
  • Certificate validity (is it properly signed?)
  • Certificate chain completeness
  • Whether the certificate matches the domain

Why it matters: An expired SSL certificate makes your site show as "Not Secure" in all major browsers, effectively taking it offline for most users. Even with auto-renewal configured, renewal can fail silently.

Domain Expiry Monitoring

Checks when your domain registration expires and alerts before the renewal deadline.

Why it matters: Domain registrations expire on a set date. If auto-renewal fails (expired payment method, registrar issue), your domain lapses and becomes available for registration by anyone. The result is instant, total downtime for your entire business.

DNS Record Monitoring

Checks whether your DNS records are correct - particularly A records, CNAME records, MX records, and NS records.

Why it matters: Incorrect or missing DNS records cause service failures that look like server outages but are actually routing problems. DNS monitoring catches misconfigurations from accidental changes or provider issues.

Heartbeat Monitoring (Cron Job Monitoring)

The inverse of HTTP monitoring. Instead of your monitor reaching out to your service, your service is expected to ping the monitor on a schedule.

If the ping doesn't arrive within the expected window, an alert fires.

What it monitors: Cron jobs, background workers, scheduled tasks, data pipelines, backup scripts

Why it matters: Cron jobs fail silently by default. A backup job that stopped running three weeks ago won't tell you it stopped - until you need the backup. Heartbeat monitoring catches this.


Key Uptime Monitoring Metrics

Uptime Percentage

The proportion of time a service was available during a measurement period.

Uptime % = (Total time - Downtime) / Total time × 100
Uptime %Downtime per yearCommon label
99%87 hours, 36 minutes"Two nines"
99.9%8 hours, 45 minutes"Three nines"
99.95%4 hours, 22 minutes-
99.99%52 minutes, 33 seconds"Four nines"
99.999%5 minutes, 15 seconds"Five nines"

Most commercial SaaS products target 99.9% ("three nines"). Enterprise infrastructure and financial systems often target 99.99% or higher.

Mean Time to Detect (MTTD)

The average time between when a failure begins and when monitoring detects it.

MTTD is directly related to your check interval. If you check every 5 minutes and a failure occurs at 11:01 PM, your MTTD can be up to 5 minutes. With 1-minute checks, it's up to 1 minute.

Average MTTD ≈ Check interval / 2

For a 1-minute check interval, your average MTTD is about 30 seconds.

Mean Time to Recovery (MTTR)

The average time from when a failure is detected to when the service is restored. MTTR includes:

  1. Detection time (covered by MTTD above)
  2. Alert delivery time
  3. Time for an engineer to acknowledge and begin investigating
  4. Time to diagnose the root cause
  5. Time to apply a fix
  6. Time to verify recovery

MTTR is the single most important reliability metric for teams with customers. It's what determines how long customers experience an outage after it begins.

False Positive Rate

The percentage of alerts that turn out to be non-incidents - monitoring artifacts caused by network path issues, probe failures, or transient errors rather than actual service failures.

A high false positive rate erodes trust in monitoring. Teams start ignoring alerts, which means real incidents get missed.

False positives are primarily caused by single-region monitoring: when a check runs from only one location, a routing issue between that probe and your server looks identical to your server being down. Multi-region consensus eliminates most false positives.


Multi-Region Consensus: Why It Matters

Most monitoring tools check your endpoint from a single probe location. If that probe can't reach your server, it fires an alert.

The problem: the internet is not a single reliable network. A request from a monitoring probe in Frankfurt to a server in Virginia passes through multiple networks, transit providers, internet exchange points, and submarine cables - any of which can have transient failures unrelated to your infrastructure.

Single-region monitoring false positive rate (simplified):

  • If the network path between probe and server is 99.95% reliable
  • And you check once per minute (1,440 checks/day)
  • You expect ~0.72 path-related failures per day from that path alone
  • That's approximately 5 false alerts per week per monitor

Multi-region consensus requires a failure to be confirmed from multiple independent probe locations before firing an alert. If Frankfurt says "down" but Virginia and Singapore say "up," the system treats it as a network path issue, not a real outage.

With three-region consensus, all three paths must fail simultaneously. If each path has 99.95% reliability and failures are independent, the probability of all three failing at once is:

0.0005 × 0.0005 × 0.0005 = 0.000000000125 = 0.0000000125%

That's roughly one path-related false positive every 15,000 years.

Multi-region consensus is the most important architectural difference between monitoring tools. It's the difference between a tool that wakes you up at 3 AM for nothing and one that only alerts when something is genuinely wrong.


What to Monitor (and What Not to)

Monitor these

Health check endpoint - a dedicated route that tests your app's critical dependencies (database connection, cache, etc.), not just the homepage

Critical API endpoints - auth routes, payment endpoints, your highest-traffic API paths

SSL certificates - with 30+ day advance warning

Domain expiry - with 60+ day advance warning

Cron jobs and background workers - via heartbeat monitoring

DNS records - for critical A, MX, and CNAME records

Don't make these mistakes

Only monitoring the homepage - the homepage can return 200 while your API is completely broken

Monitoring too many low-importance endpoints - alert fatigue from non-critical monitors drowns out real incidents

Using 5-minute check intervals - that's up to 5 minutes of undetected downtime per incident

No escalation policy - if the primary contact doesn't respond, someone else should be paged automatically

Not testing your alerting - most teams discover their Slack integration is broken during an actual incident


Check Interval: How Often Should You Check?

Check intervalBest forAverage MTTD
15–30 secondsProduction APIs, payment systems, critical infrastructure8–15 seconds
1 minuteMost SaaS production services~30 seconds
5 minutesNon-critical services, dev/staging environments~2.5 minutes
10–15 minutesBasic availability checks, low-traffic services~5–7 minutes

For most SaaS applications, 1-minute checks on critical endpoints is the right default. The cost difference between 1-minute and 5-minute checks is small; the difference in detection time is significant.


Alert Channels

When a monitor detects a failure, it needs to reach the right person through the right channel. Common alert delivery mechanisms:

ChannelBest forTypical delivery time
EmailNon-urgent alerts, audit trail30 sec – 2 min
Slack/DiscordTeams that live in Slack, fast group visibility5–30 seconds
SMSUrgent on-call pages, off-hours alerts10–60 seconds
Phone callCritical systems, highest-urgency escalation15–60 seconds
WebhookCustom integrations, PagerDuty, incident management toolsNear-instant
OpsGenie/PagerDutyEnterprise on-call routingNear-instant (then SMS/call)

Most teams use Slack for primary alerts and email or SMS for escalation if no one acknowledges.


Status Pages

A status page is a public-facing page that shows the current operational status of your services. It's typically hosted at status.yourdomain.com.

What it communicates:

  • Current status of each component (operational, degraded, outage)
  • Active incidents with status updates
  • Historical uptime and past incidents

Status pages serve two purposes:

  1. During an incident - customers can see you're aware of the problem, reducing support ticket volume and panic
  2. During sales - enterprise customers check your historical uptime record before signing contracts

Most uptime monitoring tools can automatically update your status page when a monitor detects a failure, eliminating the lag between "monitoring detected the issue" and "status page shows the issue."


Uptime Monitoring vs. APM vs. Observability

These terms overlap but serve different purposes:

Tool typeWhat it answersExamples
Uptime monitoring"Is it up or down?"Vantaj, UptimeRobot, BetterStack
APM (Application Performance Monitoring)"How is it performing? Where are the bottlenecks?"Datadog APM, New Relic, Sentry
Observability"What is happening inside my system?" (metrics, logs, traces)Datadog, Grafana, Honeycomb
Synthetic monitoring"Does this user flow work?"Checkly, Datadog Synthetics

Uptime monitoring is not a replacement for APM or observability - it's the first line of detection. It answers "are users affected right now?" in seconds. APM and observability answer "why?" once you know something is wrong.

For most teams, the practical order of adoption is:

  1. Uptime monitoring (day one - free, 5-minute setup)
  2. Error tracking (Sentry, Rollbar - early days)
  3. APM (once you have performance problems to diagnose)
  4. Full observability stack (once you have the engineering bandwidth to operate it)

Getting Started

Setting up basic uptime monitoring takes under 5 minutes:

  1. Pick a tool - free tiers from Vantaj (20 monitors), UptimeRobot (50 monitors), or Better Stack (10 monitors) are sufficient to start
  2. Add your most critical URLs - at minimum: your homepage, your API health endpoint, and your main login/auth route
  3. Configure alert channels - add your Slack workspace or email
  4. Add SSL monitoring for each domain
  5. Set up a status page - link it from your site footer

The most expensive mistake in uptime monitoring isn't choosing the wrong tool - it's not setting it up at all.

A service that goes down for 4 hours before anyone notices is usually not a monitoring tool problem. It's a "we didn't have monitoring" problem.