What is Uptime Monitoring? A Complete Guide (2026)
Uptime monitoring checks whether your websites, APIs, and services are available and responding correctly. This guide covers how it works, what to monitor, and what metrics actually matter.
Uptime monitoring is the practice of continuously checking whether a website, API, or service is available and responding as expected. When a monitor detects a failure - the service is unreachable, responding with an error, or taking too long - it sends an alert to the responsible team.
The goal is simple: know about problems before your users do, and fix them faster.
How Uptime Monitoring Works
At its core, uptime monitoring is a loop:
- A probe (a server in a monitoring datacenter) sends a request to your endpoint
- It checks whether the response matches expectations (correct status code, expected content, acceptable response time)
- If the check passes: nothing happens, the result is logged
- If the check fails: the system triggers an alert and opens an incident
This loop repeats on a check interval - typically every 30 seconds to 5 minutes depending on your plan and requirements.
What a check actually does
For an HTTP monitor, the probe:
1. Resolves the DNS name of your endpoint
2. Establishes a TCP connection
3. Performs a TLS handshake (for HTTPS)
4. Sends an HTTP GET (or configured method)
5. Receives the response
6. Validates: status code, response time, optional body content
7. Records the result with a timestamp
Each step can fail independently. A DNS resolution failure looks different from a TLS handshake timeout, which looks different from a 500 response from your application - and good monitoring systems log enough detail to distinguish between them.
Types of Uptime Monitoring
HTTP/HTTPS Monitoring
The most common type. Sends a request to a URL and validates the response.
What you can check:
- HTTP status code (200, 404, 500, etc.)
- Response time / latency
- Response body content (does it contain expected text?)
- HTTP response headers
- Redirect chain (does the URL redirect to the right place?)
Best for: Websites, API endpoints, SaaS applications, landing pages
SSL Certificate Monitoring
Checks whether your SSL/TLS certificate is valid and alerts before it expires.
What you can check:
- Days until expiry
- Certificate validity (is it properly signed?)
- Certificate chain completeness
- Whether the certificate matches the domain
Why it matters: An expired SSL certificate makes your site show as "Not Secure" in all major browsers, effectively taking it offline for most users. Even with auto-renewal configured, renewal can fail silently.
Domain Expiry Monitoring
Checks when your domain registration expires and alerts before the renewal deadline.
Why it matters: Domain registrations expire on a set date. If auto-renewal fails (expired payment method, registrar issue), your domain lapses and becomes available for registration by anyone. The result is instant, total downtime for your entire business.
DNS Record Monitoring
Checks whether your DNS records are correct - particularly A records, CNAME records, MX records, and NS records.
Why it matters: Incorrect or missing DNS records cause service failures that look like server outages but are actually routing problems. DNS monitoring catches misconfigurations from accidental changes or provider issues.
Heartbeat Monitoring (Cron Job Monitoring)
The inverse of HTTP monitoring. Instead of your monitor reaching out to your service, your service is expected to ping the monitor on a schedule.
If the ping doesn't arrive within the expected window, an alert fires.
What it monitors: Cron jobs, background workers, scheduled tasks, data pipelines, backup scripts
Why it matters: Cron jobs fail silently by default. A backup job that stopped running three weeks ago won't tell you it stopped - until you need the backup. Heartbeat monitoring catches this.
Key Uptime Monitoring Metrics
Uptime Percentage
The proportion of time a service was available during a measurement period.
Uptime % = (Total time - Downtime) / Total time × 100
| Uptime % | Downtime per year | Common label |
|---|---|---|
| 99% | 87 hours, 36 minutes | "Two nines" |
| 99.9% | 8 hours, 45 minutes | "Three nines" |
| 99.95% | 4 hours, 22 minutes | - |
| 99.99% | 52 minutes, 33 seconds | "Four nines" |
| 99.999% | 5 minutes, 15 seconds | "Five nines" |
Most commercial SaaS products target 99.9% ("three nines"). Enterprise infrastructure and financial systems often target 99.99% or higher.
Mean Time to Detect (MTTD)
The average time between when a failure begins and when monitoring detects it.
MTTD is directly related to your check interval. If you check every 5 minutes and a failure occurs at 11:01 PM, your MTTD can be up to 5 minutes. With 1-minute checks, it's up to 1 minute.
Average MTTD ≈ Check interval / 2
For a 1-minute check interval, your average MTTD is about 30 seconds.
Mean Time to Recovery (MTTR)
The average time from when a failure is detected to when the service is restored. MTTR includes:
- Detection time (covered by MTTD above)
- Alert delivery time
- Time for an engineer to acknowledge and begin investigating
- Time to diagnose the root cause
- Time to apply a fix
- Time to verify recovery
MTTR is the single most important reliability metric for teams with customers. It's what determines how long customers experience an outage after it begins.
False Positive Rate
The percentage of alerts that turn out to be non-incidents - monitoring artifacts caused by network path issues, probe failures, or transient errors rather than actual service failures.
A high false positive rate erodes trust in monitoring. Teams start ignoring alerts, which means real incidents get missed.
False positives are primarily caused by single-region monitoring: when a check runs from only one location, a routing issue between that probe and your server looks identical to your server being down. Multi-region consensus eliminates most false positives.
Multi-Region Consensus: Why It Matters
Most monitoring tools check your endpoint from a single probe location. If that probe can't reach your server, it fires an alert.
The problem: the internet is not a single reliable network. A request from a monitoring probe in Frankfurt to a server in Virginia passes through multiple networks, transit providers, internet exchange points, and submarine cables - any of which can have transient failures unrelated to your infrastructure.
Single-region monitoring false positive rate (simplified):
- If the network path between probe and server is 99.95% reliable
- And you check once per minute (1,440 checks/day)
- You expect ~0.72 path-related failures per day from that path alone
- That's approximately 5 false alerts per week per monitor
Multi-region consensus requires a failure to be confirmed from multiple independent probe locations before firing an alert. If Frankfurt says "down" but Virginia and Singapore say "up," the system treats it as a network path issue, not a real outage.
With three-region consensus, all three paths must fail simultaneously. If each path has 99.95% reliability and failures are independent, the probability of all three failing at once is:
0.0005 × 0.0005 × 0.0005 = 0.000000000125 = 0.0000000125%
That's roughly one path-related false positive every 15,000 years.
Multi-region consensus is the most important architectural difference between monitoring tools. It's the difference between a tool that wakes you up at 3 AM for nothing and one that only alerts when something is genuinely wrong.
What to Monitor (and What Not to)
Monitor these
✅ Health check endpoint - a dedicated route that tests your app's critical dependencies (database connection, cache, etc.), not just the homepage
✅ Critical API endpoints - auth routes, payment endpoints, your highest-traffic API paths
✅ SSL certificates - with 30+ day advance warning
✅ Domain expiry - with 60+ day advance warning
✅ Cron jobs and background workers - via heartbeat monitoring
✅ DNS records - for critical A, MX, and CNAME records
Don't make these mistakes
❌ Only monitoring the homepage - the homepage can return 200 while your API is completely broken
❌ Monitoring too many low-importance endpoints - alert fatigue from non-critical monitors drowns out real incidents
❌ Using 5-minute check intervals - that's up to 5 minutes of undetected downtime per incident
❌ No escalation policy - if the primary contact doesn't respond, someone else should be paged automatically
❌ Not testing your alerting - most teams discover their Slack integration is broken during an actual incident
Check Interval: How Often Should You Check?
| Check interval | Best for | Average MTTD |
|---|---|---|
| 15–30 seconds | Production APIs, payment systems, critical infrastructure | 8–15 seconds |
| 1 minute | Most SaaS production services | ~30 seconds |
| 5 minutes | Non-critical services, dev/staging environments | ~2.5 minutes |
| 10–15 minutes | Basic availability checks, low-traffic services | ~5–7 minutes |
For most SaaS applications, 1-minute checks on critical endpoints is the right default. The cost difference between 1-minute and 5-minute checks is small; the difference in detection time is significant.
Alert Channels
When a monitor detects a failure, it needs to reach the right person through the right channel. Common alert delivery mechanisms:
| Channel | Best for | Typical delivery time |
|---|---|---|
| Non-urgent alerts, audit trail | 30 sec – 2 min | |
| Slack/Discord | Teams that live in Slack, fast group visibility | 5–30 seconds |
| SMS | Urgent on-call pages, off-hours alerts | 10–60 seconds |
| Phone call | Critical systems, highest-urgency escalation | 15–60 seconds |
| Webhook | Custom integrations, PagerDuty, incident management tools | Near-instant |
| OpsGenie/PagerDuty | Enterprise on-call routing | Near-instant (then SMS/call) |
Most teams use Slack for primary alerts and email or SMS for escalation if no one acknowledges.
Status Pages
A status page is a public-facing page that shows the current operational status of your services. It's typically hosted at status.yourdomain.com.
What it communicates:
- Current status of each component (operational, degraded, outage)
- Active incidents with status updates
- Historical uptime and past incidents
Status pages serve two purposes:
- During an incident - customers can see you're aware of the problem, reducing support ticket volume and panic
- During sales - enterprise customers check your historical uptime record before signing contracts
Most uptime monitoring tools can automatically update your status page when a monitor detects a failure, eliminating the lag between "monitoring detected the issue" and "status page shows the issue."
Uptime Monitoring vs. APM vs. Observability
These terms overlap but serve different purposes:
| Tool type | What it answers | Examples |
|---|---|---|
| Uptime monitoring | "Is it up or down?" | Vantaj, UptimeRobot, BetterStack |
| APM (Application Performance Monitoring) | "How is it performing? Where are the bottlenecks?" | Datadog APM, New Relic, Sentry |
| Observability | "What is happening inside my system?" (metrics, logs, traces) | Datadog, Grafana, Honeycomb |
| Synthetic monitoring | "Does this user flow work?" | Checkly, Datadog Synthetics |
Uptime monitoring is not a replacement for APM or observability - it's the first line of detection. It answers "are users affected right now?" in seconds. APM and observability answer "why?" once you know something is wrong.
For most teams, the practical order of adoption is:
- Uptime monitoring (day one - free, 5-minute setup)
- Error tracking (Sentry, Rollbar - early days)
- APM (once you have performance problems to diagnose)
- Full observability stack (once you have the engineering bandwidth to operate it)
Getting Started
Setting up basic uptime monitoring takes under 5 minutes:
- Pick a tool - free tiers from Vantaj (20 monitors), UptimeRobot (50 monitors), or Better Stack (10 monitors) are sufficient to start
- Add your most critical URLs - at minimum: your homepage, your API health endpoint, and your main login/auth route
- Configure alert channels - add your Slack workspace or email
- Add SSL monitoring for each domain
- Set up a status page - link it from your site footer
The most expensive mistake in uptime monitoring isn't choosing the wrong tool - it's not setting it up at all.
A service that goes down for 4 hours before anyone notices is usually not a monitoring tool problem. It's a "we didn't have monitoring" problem.