The Complete Guide to Website Uptime Monitoring (2026)
Everything you need to know about uptime monitoring: how it works, types of checks, key metrics like MTTR and MTTD, how to choose a tool, and a glossary of monitoring terms.
Uptime monitoring is the practice of automatically checking whether a website, API, or online service is accessible and responding correctly. A monitoring tool sends regular requests to your endpoints from one or more locations around the world. When a check fails - the server does not respond, returns an error, or takes too long - the tool alerts your team so you can investigate and fix the issue before users are significantly affected.
This guide covers how uptime monitoring works, the different types of checks, the metrics that matter, how to choose a tool, and a glossary of every term you will encounter.
How Uptime Monitoring Works
At its core, uptime monitoring follows a simple loop:
Step 1: Configure a Check
You tell the monitoring tool what to check. At minimum, this is a URL (e.g., https://api.example.com/health). You also configure:
- Check interval - how often to run the check (every 30 seconds, 1 minute, 5 minutes)
- Expected response - what a healthy response looks like (HTTP 200 status code, a specific keyword in the body, response time under a threshold)
- Probe regions - where to check from (US, Europe, Asia-Pacific)
- Alert contacts - who to notify when the check fails (email, Slack channel, Discord, webhook)
Step 2: The Tool Sends Requests on a Schedule
Every check interval, the monitoring tool sends an HTTP request (or TCP connection, DNS query, ICMP ping, etc.) to your endpoint from the configured probe regions. It records:
- Whether the endpoint responded
- The HTTP status code returned
- The response time (latency)
- Whether the response body contains expected content
Step 3: Failure Detection
A check "fails" when the endpoint does not meet the expected criteria. Common failure conditions:
- No response - the server did not respond within the timeout window
- Wrong status code - the server returned 500, 502, 503, or another error code instead of 200
- Keyword missing - the response body does not contain expected content (useful for detecting error pages that still return 200)
- Slow response - the server responded but took longer than the configured threshold
Step 4: Verification (Multi-Region Consensus)
Good monitoring tools do not alert on the first failure. A single failed check could be caused by a network issue between the probe and your server, not an actual outage. Multi-region consensus means the tool re-checks from other probe locations before confirming a failure.
For example, Vantaj runs checks from US-East, EU-West, and AP-Southeast simultaneously. An alert only fires when all regions confirm the failure. This eliminates false positives caused by regional network blips.
Step 5: Alerting
When a failure is confirmed, the tool sends alerts through your configured channels:
- Email - to the team or on-call engineer
- Slack / Discord - to a dedicated alerts channel
- Webhooks - to trigger custom workflows (PagerDuty, OpsGenie, custom scripts)
- SMS / phone call - for critical, must-respond-immediately alerts
An incident is opened with the start time, affected regions, and failure details.
Step 6: Recovery Detection
The tool continues checking the endpoint. When it starts responding correctly again, the tool:
- Closes the incident automatically
- Sends a recovery notification
- Records the total downtime duration
This loop runs continuously, 24/7, without human intervention.
Types of Uptime Monitoring
HTTP / HTTPS Monitoring
The most common type. The tool sends an HTTP or HTTPS request to a URL and checks the response. You can configure:
- Expected status codes (200, 301, etc.)
- Keyword matching in the response body
- Custom request headers and body (for API endpoints)
- Authentication (Basic Auth, Bearer tokens)
- SSL certificate validation
Use for: Websites, APIs, web applications, microservices, health check endpoints.
TCP Port Monitoring
The tool opens a TCP connection to a specific host and port. If the connection is established, the check passes. No HTTP protocol is involved - just a raw TCP handshake.
Use for: Databases (port 3306, 5432), mail servers (port 25, 587), custom services running on non-HTTP ports, game servers.
ICMP / Ping Monitoring
The tool sends ICMP echo requests (pings) to a host and measures whether it responds. This checks basic network reachability without testing any specific application.
Use for: Network infrastructure, routers, firewalls, bare-metal servers. Less useful for web applications (a server can respond to pings while the web server is crashed).
DNS Monitoring
The tool queries DNS records for a domain and checks whether the expected records are returned. This catches DNS misconfigurations, propagation failures, and unauthorized changes.
Use for: Detecting DNS hijacking, monitoring record changes (A, MX, NS, CNAME, TXT), verifying propagation after DNS updates.
SSL Certificate Monitoring
The tool connects to a server over TLS, extracts the certificate, and checks for expiry, chain validity, hostname matching, and revocation status. Alerts are sent days or weeks before the certificate expires.
Use for: Preventing outages caused by expired or misconfigured SSL certificates. Vantaj alerts at 90, 60, 30, 7, and 1 day before expiry.
Domain Expiry Monitoring
The tool queries WHOIS or RDAP data for a domain and tracks when the registration expires. Alerts are sent well before the domain lapses.
Use for: Preventing domain expiry disasters - a lapsed domain can take your entire business offline and be registered by someone else.
Heartbeat / Cron Job Monitoring
Instead of the tool checking your service (pull-based), your service checks in with the tool (push-based). You add an HTTP request to the end of a cron job or scheduled task. If the expected ping does not arrive on time, the tool alerts you.
Use for: Cron jobs, background workers, data pipelines, backup scripts, queue consumers - any scheduled task that can fail silently.
Keyword Monitoring
The tool checks whether specific text appears (or does not appear) in the response body. This catches scenarios where a server returns HTTP 200 but serves an error page, a maintenance page, or unexpected content.
Use for: Detecting soft failures where the server responds but serves wrong content. Checking that critical content (pricing, product info) has not been removed or altered.
Key Uptime Monitoring Metrics
Uptime Percentage
The percentage of time your service was available over a given period. Calculated as:
Uptime % = ((Total time - Downtime) / Total time) × 100
Common SLA tiers and their allowed downtime per year:
| Uptime % | Called | Downtime per Year | Downtime per Month |
|---|---|---|---|
| 99% | Two nines | 3 days 15 hours | 7 hours 18 min |
| 99.9% | Three nines | 8 hours 46 min | 43 min 50 sec |
| 99.95% | Three and a half nines | 4 hours 23 min | 21 min 55 sec |
| 99.99% | Four nines | 52 min 36 sec | 4 min 23 sec |
| 99.999% | Five nines | 5 min 15 sec | 26 sec |
The difference between 99.9% and 99.99% sounds small but represents a 10x reduction in allowed downtime.
MTTD (Mean Time to Detect)
The average time between an outage starting and your team being notified. MTTD depends on:
- Check interval - a 5-minute interval means up to 5 minutes of detection delay
- Verification time - multi-region consensus adds seconds, not minutes
- Alert delivery time - Slack and email are near-instant; SMS may have carrier delays
Lower MTTD means faster response. A tool with 30-second check intervals detects outages 10x faster than one with 5-minute intervals.
MTTR (Mean Time to Repair)
The average time between detection and resolution. MTTR includes:
- Time to acknowledge the alert
- Time to diagnose the root cause
- Time to implement a fix
- Time to verify the fix works
Monitoring tools directly reduce MTTD. They indirectly reduce MTTR by providing incident timelines, affected regions, and response time history that accelerate diagnosis.
MTBF (Mean Time Between Failures)
The average time between incidents. A higher MTBF indicates a more reliable service. Tracking MTBF over time shows whether your infrastructure is becoming more or less stable.
Response Time (Latency)
The time between sending a request and receiving the full response. Monitoring tools typically track:
- Average response time over a period
- P95 / P99 response time - the response time that 95% or 99% of checks fall under
- Response time trends - gradual increases often precede full outages
A spike in response time is frequently an early warning sign of an impending outage.
How to Choose an Uptime Monitoring Tool
Check Interval
How often the tool checks your endpoints. Common options:
| Interval | Best for | Trade-off |
|---|---|---|
| 5 minutes | Personal projects, non-critical sites | Up to 5 min of undetected downtime |
| 1 minute | Production SaaS, APIs, e-commerce | Standard for most teams |
| 30 seconds | Critical infrastructure, checkout flows | Higher cost, faster detection |
| 15 seconds | Financial services, real-time systems | Enterprise-grade, highest cost |
For production services with SLA commitments, 1-minute intervals are the minimum standard.
Probe Regions
Where the tool checks from. More regions means better coverage and more reliable failure detection. Key considerations:
- Single region - cheapest but produces false positives from regional network issues
- Multi-region - checks from multiple locations simultaneously; with consensus, eliminates false positives
- Global coverage - 10+ regions for services with users worldwide
Vantaj checks from US-East, EU-West, and AP-Southeast simultaneously and uses multi-region consensus - an alert only fires when all regions confirm the failure.
Alerting Channels
Your monitoring tool should deliver alerts where your team actually sees them:
- Email - universal but easy to miss during off-hours
- Slack / Discord - reaches the team in real time during work hours
- SMS / phone call - for critical after-hours alerts that cannot wait
- Webhooks - for custom integrations (PagerDuty, OpsGenie, custom scripts)
The best tools let you route different severity levels to different channels.
False Positive Prevention
False positives erode trust in your alerting. After a few 3 AM wake-ups for phantom outages, teams start ignoring alerts - which means real outages get missed.
Look for:
- Multi-region consensus - requires failure confirmation from multiple locations
- Retry logic - re-checks before alerting
- Configurable thresholds - require multiple consecutive failures before alerting
Additional Monitoring Types
Beyond HTTP checks, consider whether you need:
- SSL certificate monitoring - alerts before certificates expire
- Domain expiry monitoring - tracks registration dates and DNS changes
- Heartbeat / cron monitoring - detects silent failures in scheduled tasks
- Status pages - public-facing pages for customer communication
- Incident management - structured incident response with timeline updates
Vantaj includes all of these alongside HTTP monitoring in a single platform.
Pricing
Monitoring pricing varies widely:
| Tool | Free Tier | Starting Paid | 100 Monitors |
|---|---|---|---|
| Vantaj | 20 monitors | $9/mo | $29/mo |
| UptimeRobot | 50 monitors | $7/mo | ~$21/mo |
| Better Stack | 10 monitors | $24/mo/user | $24/mo/user |
| Pingdom | None | $15/mo | ~$75/mo |
| Datadog | 5 synthetics | $23/mo | ~$100+/mo |
For most small-to-mid teams, a tool with a generous free tier and transparent per-plan pricing (not per-user or per-test-run) offers the best value.
Uptime Monitoring Best Practices
- Monitor from multiple regions. Single-region monitoring produces false positives. Multi-region consensus eliminates them.
- Use 1-minute intervals or faster for production. 5-minute intervals mean up to 5 minutes of undetected downtime per incident.
- Monitor the user's path, not just the server. Check your actual application URL, not just whether the server responds to pings. A server can be "up" while the application is crashed.
- Set up keyword checks. A server returning HTTP 200 with an error page is still a failure. Keyword checks catch this.
- Monitor SSL certificates. Auto-renewal fails silently more often than expected. Get alerted 30+ days before expiry.
- Track response time trends. Gradually increasing latency often precedes a full outage. Monitor trends, not just up/down status.
- Create a public status page. Reduces support tickets during outages by 30-60%. Customers check the status page instead of filing tickets.
- Use heartbeat monitoring for background jobs. Cron jobs and workers fail silently. A heartbeat monitor catches them.
- Route alerts appropriately. Send informational alerts to Slack. Send critical alerts to SMS or phone. Do not send everything to every channel - alert fatigue is real.
- Review and update monitors quarterly. As your infrastructure evolves, your monitoring should evolve with it. Remove monitors for decommissioned services. Add monitors for new endpoints.
Glossary of Uptime Monitoring Terms
| Term | Definition |
|---|---|
| Uptime | The percentage of time a service is available and functioning correctly |
| Downtime | Any period when a service is unavailable or not functioning correctly |
| Check interval | How often the monitoring tool tests your endpoint |
| Probe / probe region | A server location from which monitoring checks are sent |
| Multi-region consensus | Requiring failure confirmation from multiple probe locations before alerting |
| False positive | An alert triggered when no actual outage occurred (e.g., due to a network blip) |
| Incident | A period of detected downtime, from first failure to recovery |
| MTTD | Mean Time to Detect - average time from outage start to alert delivery |
| MTTR | Mean Time to Repair - average time from detection to resolution |
| MTBF | Mean Time Between Failures - average time between incidents |
| SLA | Service Level Agreement - a contractual commitment to a specific uptime percentage |
| Status page | A public page showing the current health and incident history of your services |
| Heartbeat monitor | A push-based check where your service pings the monitoring tool on a schedule |
| Dead man's switch | Another name for heartbeat monitoring - alerts when an expected signal stops |
| SSL/TLS | The encryption protocol that secures HTTPS connections |
| Certificate chain | The path from your server's certificate through intermediates to a trusted root CA |
| WHOIS / RDAP | Protocols for querying domain registration data (owner, expiry, registrar) |
| DNS | Domain Name System - translates domain names to IP addresses |
| HTTP status code | A 3-digit code in the server's response indicating the result (200 = OK, 500 = Server Error) |
| Latency | The time between sending a request and receiving the response |
| P95 / P99 | The response time that 95% or 99% of requests complete within |
| Keyword check | Verifying that specific text appears (or does not appear) in the response body |
| Grace period | Extra time a heartbeat monitor waits after a missed ping before alerting |
| Webhook | An HTTP callback that sends alert data to a custom URL for integration |
| Alert fatigue | When too many alerts (especially false ones) cause teams to ignore or delay responses |
| On-call rotation | A schedule determining which team member responds to alerts at any given time |
| Postmortem | A structured review of an incident - what happened, why, and how to prevent recurrence |
| RUM | Real User Monitoring - collecting performance data from actual user browser sessions |
| Synthetic monitoring | Sending automated test requests to simulate user interactions |
Frequently Asked Questions
What is uptime monitoring?
Uptime monitoring is the practice of automatically checking whether a website, API, or service is accessible and responding correctly. A monitoring tool sends regular HTTP requests (or other check types) to your endpoints from one or more global locations. When a check fails, the tool alerts your team via email, Slack, Discord, SMS, or webhooks so you can fix the issue before users are significantly affected.
How does uptime monitoring work?
A monitoring tool sends an HTTP request to your endpoint at a regular interval (e.g., every 60 seconds) from one or more probe regions. It checks whether the response meets your criteria (correct status code, expected content, acceptable response time). If the check fails and the failure is confirmed from multiple regions (multi-region consensus), the tool opens an incident and sends alerts to your team. When the endpoint recovers, the incident is closed automatically.
How much does uptime monitoring cost?
Free tiers are available from several providers: Vantaj (20 monitors), UptimeRobot (50 monitors), and Better Stack (10 monitors). Paid plans typically start between $7-$24/month depending on the tool, check intervals, and features included. Enterprise plans with advanced features like SSO, on-call policies, and 15-second intervals range from $50-$500+/month.
What is the difference between uptime monitoring and APM?
Uptime monitoring checks whether your service is accessible from outside your infrastructure (external, synthetic checks). Application Performance Monitoring (APM) runs inside your application and tracks code-level performance - function execution times, database query durations, memory usage, and error traces. Uptime monitoring answers "is it up?" APM answers "why is it slow?" Most teams need both, but uptime monitoring is the starting point.
What is multi-region consensus and why does it matter?
Multi-region consensus means running checks from multiple geographic locations simultaneously and only triggering an alert when all regions confirm a failure. Without consensus, a temporary network issue between a single probe location and your server triggers a false alert. With consensus, false positives from regional network blips are eliminated. Vantaj checks from US-East, EU-West, and AP-Southeast and requires all regions to confirm before alerting.
How often should I check my website?
For production services, 1-minute check intervals are the standard minimum. Critical services (checkout pages, payment APIs, services with SLA commitments) benefit from 30-second intervals. Personal projects and non-critical sites can use 5-minute intervals. The trade-off is always between detection speed and cost - faster intervals detect outages sooner but cost more on most platforms.
What should I monitor beyond my website?
A comprehensive monitoring setup includes: your main website, API endpoints, authentication/login flows, payment and checkout pages, CDN and static assets, SSL certificates (expiry and configuration), domain registrations (expiry and DNS records), background jobs and cron tasks (via heartbeat monitoring), and third-party services your application depends on.
What is the difference between synthetic monitoring and real user monitoring?
Synthetic monitoring sends automated requests to your endpoints at regular intervals from predefined locations - it simulates a user. Real User Monitoring (RUM) collects performance data from actual user sessions in their browsers. Synthetic monitoring tells you "is it up and how fast does it respond to a test?" RUM tells you "how fast does it actually load for real users on real devices and networks?" Most uptime monitoring tools focus on synthetic monitoring.
How do I reduce false positive alerts?
Use a monitoring tool with multi-region consensus (requires failure confirmation from multiple locations). Set appropriate timeouts - too short causes false timeouts for endpoints that are slow but not down. Use retry logic or require multiple consecutive failures before alerting. Avoid monitoring from a single region, as regional network issues will trigger false alerts.
What is a good uptime percentage?
For most SaaS products, 99.9% uptime (three nines) is the standard target - this allows about 8 hours and 46 minutes of downtime per year. Critical infrastructure targets 99.99% (four nines, about 52 minutes per year). Five nines (99.999%, about 5 minutes per year) is reserved for financial systems and infrastructure providers. Your target should match your SLA commitments and customer expectations.