[{"data":1,"prerenderedAt":921},["ShallowReactive",2],{"\u002Fblog\u002Fwhat-is-uptime-monitoring":3},{"id":4,"title":5,"author":6,"body":8,"category":909,"date":910,"description":911,"extension":912,"image":913,"lastUpdated":913,"meta":914,"navigation":915,"path":916,"readingTime":917,"seo":918,"stem":919,"__hash__":920},"blog\u002Fblog\u002Fwhat-is-uptime-monitoring.md","What is Uptime Monitoring? A Complete Guide (2026)",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":879},"minimark",[11,19,22,25,30,33,53,60,65,68,79,82,84,88,92,95,100,118,124,128,131,135,149,155,159,162,167,171,174,179,183,186,189,195,200,202,206,210,213,219,297,300,304,307,310,316,319,323,326,346,349,353,356,359,372,374,378,381,384,390,404,409,412,418,421,424,426,430,434,441,447,453,459,465,471,475,482,488,494,500,506,508,512,573,580,582,586,589,683,686,688,692,699,704,715,718,732,735,737,741,744,812,815,818,832,834,838,841,873,876],[12,13,14,18],"p",{},[15,16,17],"strong",{},"Uptime monitoring"," is the practice of continuously checking whether a website, API, or service is available and responding as expected. When a monitor detects a failure - the service is unreachable, responding with an error, or taking too long - it sends an alert to the responsible team.",[12,20,21],{},"The goal is simple: know about problems before your users do, and fix them faster.",[23,24],"hr",{},[26,27,29],"h2",{"id":28},"how-uptime-monitoring-works","How Uptime Monitoring Works",[12,31,32],{},"At its core, uptime monitoring is a loop:",[34,35,36,44,47,50],"ol",{},[37,38,39,40,43],"li",{},"A ",[15,41,42],{},"probe"," (a server in a monitoring datacenter) sends a request to your endpoint",[37,45,46],{},"It checks whether the response matches expectations (correct status code, expected content, acceptable response time)",[37,48,49],{},"If the check passes: nothing happens, the result is logged",[37,51,52],{},"If the check fails: the system triggers an alert and opens an incident",[12,54,55,56,59],{},"This loop repeats on a ",[15,57,58],{},"check interval"," - typically every 30 seconds to 5 minutes depending on your plan and requirements.",[61,62,64],"h3",{"id":63},"what-a-check-actually-does","What a check actually does",[12,66,67],{},"For an HTTP monitor, the probe:",[69,70,75],"pre",{"className":71,"code":73,"language":74},[72],"language-text","1. Resolves the DNS name of your endpoint\n2. Establishes a TCP connection\n3. Performs a TLS handshake (for HTTPS)\n4. Sends an HTTP GET (or configured method)\n5. Receives the response\n6. Validates: status code, response time, optional body content\n7. Records the result with a timestamp\n","text",[76,77,73],"code",{"__ignoreMap":78},"",[12,80,81],{},"Each step can fail independently. A DNS resolution failure looks different from a TLS handshake timeout, which looks different from a 500 response from your application - and good monitoring systems log enough detail to distinguish between them.",[23,83],{},[26,85,87],{"id":86},"types-of-uptime-monitoring","Types of Uptime Monitoring",[61,89,91],{"id":90},"httphttps-monitoring","HTTP\u002FHTTPS Monitoring",[12,93,94],{},"The most common type. Sends a request to a URL and validates the response.",[12,96,97],{},[15,98,99],{},"What you can check:",[101,102,103,106,109,112,115],"ul",{},[37,104,105],{},"HTTP status code (200, 404, 500, etc.)",[37,107,108],{},"Response time \u002F latency",[37,110,111],{},"Response body content (does it contain expected text?)",[37,113,114],{},"HTTP response headers",[37,116,117],{},"Redirect chain (does the URL redirect to the right place?)",[12,119,120,123],{},[15,121,122],{},"Best for:"," Websites, API endpoints, SaaS applications, landing pages",[61,125,127],{"id":126},"ssl-certificate-monitoring","SSL Certificate Monitoring",[12,129,130],{},"Checks whether your SSL\u002FTLS certificate is valid and alerts before it expires.",[12,132,133],{},[15,134,99],{},[101,136,137,140,143,146],{},[37,138,139],{},"Days until expiry",[37,141,142],{},"Certificate validity (is it properly signed?)",[37,144,145],{},"Certificate chain completeness",[37,147,148],{},"Whether the certificate matches the domain",[12,150,151,154],{},[15,152,153],{},"Why it matters:"," An expired SSL certificate makes your site show as \"Not Secure\" in all major browsers, effectively taking it offline for most users. Even with auto-renewal configured, renewal can fail silently.",[61,156,158],{"id":157},"domain-expiry-monitoring","Domain Expiry Monitoring",[12,160,161],{},"Checks when your domain registration expires and alerts before the renewal deadline.",[12,163,164,166],{},[15,165,153],{}," Domain registrations expire on a set date. If auto-renewal fails (expired payment method, registrar issue), your domain lapses and becomes available for registration by anyone. The result is instant, total downtime for your entire business.",[61,168,170],{"id":169},"dns-record-monitoring","DNS Record Monitoring",[12,172,173],{},"Checks whether your DNS records are correct - particularly A records, CNAME records, MX records, and NS records.",[12,175,176,178],{},[15,177,153],{}," Incorrect or missing DNS records cause service failures that look like server outages but are actually routing problems. DNS monitoring catches misconfigurations from accidental changes or provider issues.",[61,180,182],{"id":181},"heartbeat-monitoring-cron-job-monitoring","Heartbeat Monitoring (Cron Job Monitoring)",[12,184,185],{},"The inverse of HTTP monitoring. Instead of your monitor reaching out to your service, your service is expected to ping the monitor on a schedule.",[12,187,188],{},"If the ping doesn't arrive within the expected window, an alert fires.",[12,190,191,194],{},[15,192,193],{},"What it monitors:"," Cron jobs, background workers, scheduled tasks, data pipelines, backup scripts",[12,196,197,199],{},[15,198,153],{}," Cron jobs fail silently by default. A backup job that stopped running three weeks ago won't tell you it stopped - until you need the backup. Heartbeat monitoring catches this.",[23,201],{},[26,203,205],{"id":204},"key-uptime-monitoring-metrics","Key Uptime Monitoring Metrics",[61,207,209],{"id":208},"uptime-percentage","Uptime Percentage",[12,211,212],{},"The proportion of time a service was available during a measurement period.",[69,214,217],{"className":215,"code":216,"language":74},[72],"Uptime % = (Total time - Downtime) \u002F Total time × 100\n",[76,218,216],{"__ignoreMap":78},[220,221,222,238],"table",{},[223,224,225],"thead",{},[226,227,228,232,235],"tr",{},[229,230,231],"th",{},"Uptime %",[229,233,234],{},"Downtime per year",[229,236,237],{},"Common label",[239,240,241,253,264,275,286],"tbody",{},[226,242,243,247,250],{},[244,245,246],"td",{},"99%",[244,248,249],{},"87 hours, 36 minutes",[244,251,252],{},"\"Two nines\"",[226,254,255,258,261],{},[244,256,257],{},"99.9%",[244,259,260],{},"8 hours, 45 minutes",[244,262,263],{},"\"Three nines\"",[226,265,266,269,272],{},[244,267,268],{},"99.95%",[244,270,271],{},"4 hours, 22 minutes",[244,273,274],{},"-",[226,276,277,280,283],{},[244,278,279],{},"99.99%",[244,281,282],{},"52 minutes, 33 seconds",[244,284,285],{},"\"Four nines\"",[226,287,288,291,294],{},[244,289,290],{},"99.999%",[244,292,293],{},"5 minutes, 15 seconds",[244,295,296],{},"\"Five nines\"",[12,298,299],{},"Most commercial SaaS products target 99.9% (\"three nines\"). Enterprise infrastructure and financial systems often target 99.99% or higher.",[61,301,303],{"id":302},"mean-time-to-detect-mttd","Mean Time to Detect (MTTD)",[12,305,306],{},"The average time between when a failure begins and when monitoring detects it.",[12,308,309],{},"MTTD is directly related to your check interval. If you check every 5 minutes and a failure occurs at 11:01 PM, your MTTD can be up to 5 minutes. With 1-minute checks, it's up to 1 minute.",[69,311,314],{"className":312,"code":313,"language":74},[72],"Average MTTD ≈ Check interval \u002F 2\n",[76,315,313],{"__ignoreMap":78},[12,317,318],{},"For a 1-minute check interval, your average MTTD is about 30 seconds.",[61,320,322],{"id":321},"mean-time-to-recovery-mttr","Mean Time to Recovery (MTTR)",[12,324,325],{},"The average time from when a failure is detected to when the service is restored. MTTR includes:",[34,327,328,331,334,337,340,343],{},[37,329,330],{},"Detection time (covered by MTTD above)",[37,332,333],{},"Alert delivery time",[37,335,336],{},"Time for an engineer to acknowledge and begin investigating",[37,338,339],{},"Time to diagnose the root cause",[37,341,342],{},"Time to apply a fix",[37,344,345],{},"Time to verify recovery",[12,347,348],{},"MTTR is the single most important reliability metric for teams with customers. It's what determines how long customers experience an outage after it begins.",[61,350,352],{"id":351},"false-positive-rate","False Positive Rate",[12,354,355],{},"The percentage of alerts that turn out to be non-incidents - monitoring artifacts caused by network path issues, probe failures, or transient errors rather than actual service failures.",[12,357,358],{},"A high false positive rate erodes trust in monitoring. Teams start ignoring alerts, which means real incidents get missed.",[12,360,361,362,365,366,371],{},"False positives are primarily caused by ",[15,363,364],{},"single-region monitoring",": when a check runs from only one location, a routing issue between that probe and your server looks identical to your server being down. ",[367,368,370],"a",{"href":369},"#multi-region-consensus","Multi-region consensus"," eliminates most false positives.",[23,373],{},[26,375,377],{"id":376},"multi-region-consensus-why-it-matters","Multi-Region Consensus: Why It Matters",[12,379,380],{},"Most monitoring tools check your endpoint from a single probe location. If that probe can't reach your server, it fires an alert.",[12,382,383],{},"The problem: the internet is not a single reliable network. A request from a monitoring probe in Frankfurt to a server in Virginia passes through multiple networks, transit providers, internet exchange points, and submarine cables - any of which can have transient failures unrelated to your infrastructure.",[12,385,386,389],{},[15,387,388],{},"Single-region monitoring false positive rate"," (simplified):",[101,391,392,395,398,401],{},[37,393,394],{},"If the network path between probe and server is 99.95% reliable",[37,396,397],{},"And you check once per minute (1,440 checks\u002Fday)",[37,399,400],{},"You expect ~0.72 path-related failures per day from that path alone",[37,402,403],{},"That's approximately 5 false alerts per week per monitor",[12,405,406,408],{},[15,407,370],{}," requires a failure to be confirmed from multiple independent probe locations before firing an alert. If Frankfurt says \"down\" but Virginia and Singapore say \"up,\" the system treats it as a network path issue, not a real outage.",[12,410,411],{},"With three-region consensus, all three paths must fail simultaneously. If each path has 99.95% reliability and failures are independent, the probability of all three failing at once is:",[69,413,416],{"className":414,"code":415,"language":74},[72],"0.0005 × 0.0005 × 0.0005 = 0.000000000125 = 0.0000000125%\n",[76,417,415],{"__ignoreMap":78},[12,419,420],{},"That's roughly one path-related false positive every 15,000 years.",[12,422,423],{},"Multi-region consensus is the most important architectural difference between monitoring tools. It's the difference between a tool that wakes you up at 3 AM for nothing and one that only alerts when something is genuinely wrong.",[23,425],{},[26,427,429],{"id":428},"what-to-monitor-and-what-not-to","What to Monitor (and What Not to)",[61,431,433],{"id":432},"monitor-these","Monitor these",[12,435,436,437,440],{},"✅ ",[15,438,439],{},"Health check endpoint"," - a dedicated route that tests your app's critical dependencies (database connection, cache, etc.), not just the homepage",[12,442,436,443,446],{},[15,444,445],{},"Critical API endpoints"," - auth routes, payment endpoints, your highest-traffic API paths",[12,448,436,449,452],{},[15,450,451],{},"SSL certificates"," - with 30+ day advance warning",[12,454,436,455,458],{},[15,456,457],{},"Domain expiry"," - with 60+ day advance warning",[12,460,436,461,464],{},[15,462,463],{},"Cron jobs and background workers"," - via heartbeat monitoring",[12,466,436,467,470],{},[15,468,469],{},"DNS records"," - for critical A, MX, and CNAME records",[61,472,474],{"id":473},"dont-make-these-mistakes","Don't make these mistakes",[12,476,477,478,481],{},"❌ ",[15,479,480],{},"Only monitoring the homepage"," - the homepage can return 200 while your API is completely broken",[12,483,477,484,487],{},[15,485,486],{},"Monitoring too many low-importance endpoints"," - alert fatigue from non-critical monitors drowns out real incidents",[12,489,477,490,493],{},[15,491,492],{},"Using 5-minute check intervals"," - that's up to 5 minutes of undetected downtime per incident",[12,495,477,496,499],{},[15,497,498],{},"No escalation policy"," - if the primary contact doesn't respond, someone else should be paged automatically",[12,501,477,502,505],{},[15,503,504],{},"Not testing your alerting"," - most teams discover their Slack integration is broken during an actual incident",[23,507],{},[26,509,511],{"id":510},"check-interval-how-often-should-you-check","Check Interval: How Often Should You Check?",[220,513,514,527],{},[223,515,516],{},[226,517,518,521,524],{},[229,519,520],{},"Check interval",[229,522,523],{},"Best for",[229,525,526],{},"Average MTTD",[239,528,529,540,551,562],{},[226,530,531,534,537],{},[244,532,533],{},"15–30 seconds",[244,535,536],{},"Production APIs, payment systems, critical infrastructure",[244,538,539],{},"8–15 seconds",[226,541,542,545,548],{},[244,543,544],{},"1 minute",[244,546,547],{},"Most SaaS production services",[244,549,550],{},"~30 seconds",[226,552,553,556,559],{},[244,554,555],{},"5 minutes",[244,557,558],{},"Non-critical services, dev\u002Fstaging environments",[244,560,561],{},"~2.5 minutes",[226,563,564,567,570],{},[244,565,566],{},"10–15 minutes",[244,568,569],{},"Basic availability checks, low-traffic services",[244,571,572],{},"~5–7 minutes",[12,574,575,576,579],{},"For most SaaS applications, ",[15,577,578],{},"1-minute checks"," on critical endpoints is the right default. The cost difference between 1-minute and 5-minute checks is small; the difference in detection time is significant.",[23,581],{},[26,583,585],{"id":584},"alert-channels","Alert Channels",[12,587,588],{},"When a monitor detects a failure, it needs to reach the right person through the right channel. Common alert delivery mechanisms:",[220,590,591,603],{},[223,592,593],{},[226,594,595,598,600],{},[229,596,597],{},"Channel",[229,599,523],{},[229,601,602],{},"Typical delivery time",[239,604,605,618,631,644,657,670],{},[226,606,607,612,615],{},[244,608,609],{},[15,610,611],{},"Email",[244,613,614],{},"Non-urgent alerts, audit trail",[244,616,617],{},"30 sec – 2 min",[226,619,620,625,628],{},[244,621,622],{},[15,623,624],{},"Slack\u002FDiscord",[244,626,627],{},"Teams that live in Slack, fast group visibility",[244,629,630],{},"5–30 seconds",[226,632,633,638,641],{},[244,634,635],{},[15,636,637],{},"SMS",[244,639,640],{},"Urgent on-call pages, off-hours alerts",[244,642,643],{},"10–60 seconds",[226,645,646,651,654],{},[244,647,648],{},[15,649,650],{},"Phone call",[244,652,653],{},"Critical systems, highest-urgency escalation",[244,655,656],{},"15–60 seconds",[226,658,659,664,667],{},[244,660,661],{},[15,662,663],{},"Webhook",[244,665,666],{},"Custom integrations, PagerDuty, incident management tools",[244,668,669],{},"Near-instant",[226,671,672,677,680],{},[244,673,674],{},[15,675,676],{},"OpsGenie\u002FPagerDuty",[244,678,679],{},"Enterprise on-call routing",[244,681,682],{},"Near-instant (then SMS\u002Fcall)",[12,684,685],{},"Most teams use Slack for primary alerts and email or SMS for escalation if no one acknowledges.",[23,687],{},[26,689,691],{"id":690},"status-pages","Status Pages",[12,693,694,695,698],{},"A status page is a public-facing page that shows the current operational status of your services. It's typically hosted at ",[76,696,697],{},"status.yourdomain.com",".",[12,700,701],{},[15,702,703],{},"What it communicates:",[101,705,706,709,712],{},[37,707,708],{},"Current status of each component (operational, degraded, outage)",[37,710,711],{},"Active incidents with status updates",[37,713,714],{},"Historical uptime and past incidents",[12,716,717],{},"Status pages serve two purposes:",[34,719,720,726],{},[37,721,722,725],{},[15,723,724],{},"During an incident"," - customers can see you're aware of the problem, reducing support ticket volume and panic",[37,727,728,731],{},[15,729,730],{},"During sales"," - enterprise customers check your historical uptime record before signing contracts",[12,733,734],{},"Most uptime monitoring tools can automatically update your status page when a monitor detects a failure, eliminating the lag between \"monitoring detected the issue\" and \"status page shows the issue.\"",[23,736],{},[26,738,740],{"id":739},"uptime-monitoring-vs-apm-vs-observability","Uptime Monitoring vs. APM vs. Observability",[12,742,743],{},"These terms overlap but serve different purposes:",[220,745,746,759],{},[223,747,748],{},[226,749,750,753,756],{},[229,751,752],{},"Tool type",[229,754,755],{},"What it answers",[229,757,758],{},"Examples",[239,760,761,773,786,799],{},[226,762,763,767,770],{},[244,764,765],{},[15,766,17],{},[244,768,769],{},"\"Is it up or down?\"",[244,771,772],{},"Vantaj, UptimeRobot, BetterStack",[226,774,775,780,783],{},[244,776,777],{},[15,778,779],{},"APM (Application Performance Monitoring)",[244,781,782],{},"\"How is it performing? Where are the bottlenecks?\"",[244,784,785],{},"Datadog APM, New Relic, Sentry",[226,787,788,793,796],{},[244,789,790],{},[15,791,792],{},"Observability",[244,794,795],{},"\"What is happening inside my system?\" (metrics, logs, traces)",[244,797,798],{},"Datadog, Grafana, Honeycomb",[226,800,801,806,809],{},[244,802,803],{},[15,804,805],{},"Synthetic monitoring",[244,807,808],{},"\"Does this user flow work?\"",[244,810,811],{},"Checkly, Datadog Synthetics",[12,813,814],{},"Uptime monitoring is not a replacement for APM or observability - it's the first line of detection. It answers \"are users affected right now?\" in seconds. APM and observability answer \"why?\" once you know something is wrong.",[12,816,817],{},"For most teams, the practical order of adoption is:",[34,819,820,823,826,829],{},[37,821,822],{},"Uptime monitoring (day one - free, 5-minute setup)",[37,824,825],{},"Error tracking (Sentry, Rollbar - early days)",[37,827,828],{},"APM (once you have performance problems to diagnose)",[37,830,831],{},"Full observability stack (once you have the engineering bandwidth to operate it)",[23,833],{},[26,835,837],{"id":836},"getting-started","Getting Started",[12,839,840],{},"Setting up basic uptime monitoring takes under 5 minutes:",[34,842,843,849,855,861,867],{},[37,844,845,848],{},[15,846,847],{},"Pick a tool"," - free tiers from Vantaj (20 monitors), UptimeRobot (50 monitors), or Better Stack (10 monitors) are sufficient to start",[37,850,851,854],{},[15,852,853],{},"Add your most critical URLs"," - at minimum: your homepage, your API health endpoint, and your main login\u002Fauth route",[37,856,857,860],{},[15,858,859],{},"Configure alert channels"," - add your Slack workspace or email",[37,862,863,866],{},[15,864,865],{},"Add SSL monitoring"," for each domain",[37,868,869,872],{},[15,870,871],{},"Set up a status page"," - link it from your site footer",[12,874,875],{},"The most expensive mistake in uptime monitoring isn't choosing the wrong tool - it's not setting it up at all.",[12,877,878],{},"A service that goes down for 4 hours before anyone notices is usually not a monitoring tool problem. It's a \"we didn't have monitoring\" problem.",{"title":78,"searchDepth":880,"depth":880,"links":881},2,[882,886,893,899,900,904,905,906,907,908],{"id":28,"depth":880,"text":29,"children":883},[884],{"id":63,"depth":885,"text":64},3,{"id":86,"depth":880,"text":87,"children":887},[888,889,890,891,892],{"id":90,"depth":885,"text":91},{"id":126,"depth":885,"text":127},{"id":157,"depth":885,"text":158},{"id":169,"depth":885,"text":170},{"id":181,"depth":885,"text":182},{"id":204,"depth":880,"text":205,"children":894},[895,896,897,898],{"id":208,"depth":885,"text":209},{"id":302,"depth":885,"text":303},{"id":321,"depth":885,"text":322},{"id":351,"depth":885,"text":352},{"id":376,"depth":880,"text":377},{"id":428,"depth":880,"text":429,"children":901},[902,903],{"id":432,"depth":885,"text":433},{"id":473,"depth":885,"text":474},{"id":510,"depth":880,"text":511},{"id":584,"depth":880,"text":585},{"id":690,"depth":880,"text":691},{"id":739,"depth":880,"text":740},{"id":836,"depth":880,"text":837},"guides","2026-06-24","Uptime monitoring checks whether your websites, APIs, and services are available and responding correctly. This guide covers how it works, what to monitor, and what metrics actually matter.","md",null,{},true,"\u002Fblog\u002Fwhat-is-uptime-monitoring",12,{"title":5,"description":911},"blog\u002Fwhat-is-uptime-monitoring","3F7nzwGhF-RMqvPZh8hMcVAdDlpz4B01kEmSkqlzoeo",1782314800226]