What Breaks in Real-World Monitoring (And Nobody Talks About)

The Dashboard Is Green. Your Users Disagree.

Every monitoring setup has the same comforting promise: if something breaks, you'll know. The dashboard says everything is operational. The uptime percentage is 99.98%. The SLA report looks great.

Then a customer emails: "Your app has been broken for two hours."

You check the monitoring dashboard. Green. All checks passing. Everything fine.

Except it isn't. The monitoring is checking the wrong thing, in the wrong way, from the wrong place. And nobody noticed because the dashboard never turned red.

Here are 12 ways this happens in production - not in theory, not in contrived examples, but in real systems run by competent teams who thought they had monitoring figured out.

1. The Health Check Endpoint That Lies

This is the most common failure mode, and the hardest to catch.

Your /health endpoint returns 200 OK. Your monitoring sees the 200 and marks the service as healthy. But the health endpoint is a simple route handler that returns a static response. It doesn't check the database connection. It doesn't verify that the cache is reachable. It doesn't confirm that the payment processor API is responding.

The database went down 20 minutes ago. Every user request that hits the database fails with a 500 error. But /health still returns 200, because /health doesn't touch the database.

The fix: Health check endpoints must verify actual dependencies - database connectivity, cache availability, critical third-party API reachability. A health endpoint that doesn't check dependencies isn't a health check. It's a liveness ping for your web server process, and that's a different thing entirely.

2. SSL Certificates That Expire on Weekends

Auto-renewal is supposed to make certificate expiry a non-issue. Let's Encrypt, Certbot, managed certificates from your cloud provider - it should just work.

Until it doesn't. The ACME client's cron job failed silently three weeks ago. The DNS challenge can't resolve because someone changed the DNS provider. The certificate renewed, but the web server's reload command failed and it's still serving the old cert. The wildcard cert for *.example.com renewed, but the cert for api.example.com is a separate certificate that nobody remembered to include in the renewal pipeline.

These failures always surface on weekends and holidays. The cert expires Friday at 11 PM. Nobody's watching. Saturday morning, Chrome starts showing "Your connection is not private" to every visitor. By the time someone notices on Monday, you've had 60 hours of users bouncing off a security warning.

The fix: Monitor certificate expiry dates independently of your renewal process. If a cert is expiring in less than 14 days, something has gone wrong with renewal, and you need to know now - not when browsers start rejecting it.

3. The CDN Serving Stale Content With a 200

Your origin server goes down. Your CDN has the last successful response cached. It keeps serving that cached response with a 200 OK status code to every user.

Your monitoring checks the URL, gets a 200, and marks everything as healthy.

Meanwhile, the cached content is from 6 hours ago. The pricing page shows yesterday's prices. The dashboard shows stale data. The API endpoint returns outdated results. Users notice. Your monitoring doesn't.

The fix: Use response body validation. Check that the response contains a dynamic element - a timestamp, a unique request ID, a value that changes with each request. If the response body is identical across multiple checks, something is wrong even if the status code is fine. Alternatively, monitor the origin server directly, bypassing the CDN.

4. DNS Propagation Delays That Look Like Outages

You migrate to a new hosting provider. You update your DNS records. Your monitoring tool resolves the domain to the new IP immediately because it doesn't cache DNS responses (or it has a very short TTL).

But your users' DNS resolvers - their ISPs, their corporate DNS, Google's public resolver with aggressive caching - are still serving the old IP address. Some users get the new server. Some users get the old server, which is now returning errors or a default page.

Your monitoring says everything is fine, because from the monitoring probe's perspective, it is. Half your users disagree.

The fix: Monitor from probes that use different DNS resolvers, or run a dedicated DNS propagation check after changes. Don't assume that because your monitoring can resolve the new IP, your users can too. DNS propagation takes 24–72 hours in the worst case, and during that window, your uptime is a function of geography and resolver caching behavior.

5. Monitoring the Wrong Endpoint Entirely

This one is embarrassingly common. The team sets up monitoring for https://example.com - the marketing homepage. It's served by a CDN, rarely changes, and almost never goes down.

Meanwhile, the actual product lives at https://app.example.com. The API is at https://api.example.com. The authentication service is at https://auth.example.com. None of these are monitored.

The app goes down. The homepage stays up. The monitoring is green. Users can't log in.

The fix: Monitor every user-facing surface, not just the main domain. Especially: login/auth endpoints, API health checks, payment flows, and any subdomain that serves critical functionality. If a user interacts with it, you should be monitoring it.

6. Regional Outages That Your Monitoring Misses

Your infrastructure is deployed globally - US-East, EU-West, AP-Southeast. Your monitoring probe is in US-East.

The EU-West deployment has a misconfigured load balancer. European users are getting 502 errors. Your monitoring in US-East checks the endpoint, gets routed to the US-East deployment, sees a 200, and reports all clear.

A third of your user base is down. Your monitoring says 100% uptime.

The fix: Monitor from every region where you have users, not just where your infrastructure is deployed. If your users are global, your monitoring must be global. A check from one region can only tell you about availability from that region.

7. Redirects That Hide Failures

Your monitoring is configured to check https://example.com. The server responds with a 301 Redirect to https://www.example.com. Your monitoring tool follows the redirect, gets a 200 from the final destination, and marks it as healthy.

Then someone changes the redirect. Now https://example.com redirects to https://example.com/maintenance. The maintenance page returns a 200. Your monitoring follows the redirect, sees the 200, and reports everything is fine.

Your users see a maintenance page. Your dashboard says the site is up.

The fix: Be deliberate about whether your monitoring follows redirects. If you're monitoring a URL that should return a direct 200, configure the check to treat redirects as failures. If you expect a redirect chain, verify the final destination URL is correct, not just the status code.

8. The Flapping Service Nobody Notices

Your service is up 95% of the time and down 5% of the time, in random 30-second bursts throughout the day. Your monitoring checks every 5 minutes. Most checks land during the 95% uptime window and pass.

Occasionally, a check catches a down period and an alert fires. Thirty seconds later, the service recovers. The monitoring sends a recovery notification. The team sees the alert and the immediate recovery, shrugs, and moves on.

This happens three times a week. Nobody investigates because each individual incident looks trivial. But the aggregate impact is significant: 5% of user requests are failing. At 10,000 requests per day, that's 500 failed requests daily. Users are experiencing errors regularly, and the monitoring is technically reporting them - it's just reporting them in a way that makes them look insignificant.

The fix: Track flap frequency, not just individual incidents. If a monitor goes down and up more than twice in an hour, something is systematically wrong. The individual events are noise; the pattern is the signal. Good monitoring tools should detect and flag flapping behavior as a distinct problem.

9. API Responses That Return 200 With Error Bodies

Your API always returns HTTP 200. Errors are communicated in the response body: {"status": "error", "message": "database connection failed"}. This is a design pattern (arguably a bad one, but it's common, especially in older APIs and GraphQL endpoints).

Your monitoring checks the HTTP status code. 200. All good.

Your users get error responses on every request. Your monitoring never notices.

The fix: Validate response content, not just status codes. Check for expected strings in the body ("status":"ok") or check that error strings are absent. This catches the class of failures where the HTTP layer is fine but the application layer is broken.

10. Third-Party Dependencies That Fail Silently

Your app depends on Stripe for payments, SendGrid for email, Twilio for SMS, and Auth0 for authentication. Your monitoring checks your app's endpoints. Everything returns 200.

But Stripe's API is returning 503s. Every checkout attempt fails. Your app handles the Stripe error gracefully - it shows a "please try again" message to the user and returns a 200 to the browser.

Your monitoring sees the 200. Your users see "please try again" for the sixth time.

The fix: Monitor critical third-party dependencies directly, in addition to your own endpoints. Check Stripe's API status, SendGrid's delivery endpoint, Auth0's token endpoint. When a third-party dependency fails, you want to know immediately - not after users report that checkout is broken.

11. Timeouts That Aren't Outages (But Hurt Just As Much)

Your API is technically up. It responds to every request. But under load, response times climb from 200ms to 8 seconds. Your monitoring has a 10-second timeout. The check passes, because technically the response arrived before the timeout.

But users experience an 8-second load time as a broken page. They close the tab. Mobile users on slower connections time out entirely. Your bounce rate spikes. Your conversion rate drops.

Your monitoring reports 100% uptime with slightly elevated response times. Nobody is alerted because "slow" isn't the same as "down."

The fix: Set meaningful performance thresholds, not just availability thresholds. If your API normally responds in 200ms and it starts taking 5 seconds, that should trigger an alert - even though the endpoint is technically available. Define "degraded" as a distinct state from "down" and alert on both.

12. Monitoring That Only Checks During Business Hours

This isn't a configuration issue - it's a human behavior issue. The team sets up monitoring and watches the dashboard during work hours. On evenings and weekends, nobody's looking.

The monitoring tool sends alerts at 2 AM. The alert goes to email. Nobody's checking email at 2 AM. The alert goes to Slack. Nobody's watching Slack at 2 AM. The alert goes to a PagerDuty-style rotation, but the on-call engineer silenced their phone because last week's false positive woke them up (see: alert fatigue).

The outage starts at midnight Saturday and runs until 8 AM Monday. 32 hours. The monitoring tool reported it immediately. Nobody was listening.

The fix: Alerts must reach a human who will act on them, 24/7. That means phone calls for critical services, not just Slack messages. It means on-call rotations with escalation policies. And it means solving your false positive problem first - because nobody will keep their phone on loud for alerts they don't trust.

The Pattern Behind All of These

Every failure on this list shares the same root cause: the monitoring is testing a proxy for user experience, not the actual user experience.

Checking an HTTP status code is a proxy. Checking a health endpoint is a proxy. Checking from one geographic location is a proxy. Every level of indirection between "what the monitoring checks" and "what the user experiences" is a place where failures can hide.

The closer your monitoring gets to replicating what a real user does - from real locations, checking real endpoints, validating real response content - the fewer of these silent failures you'll miss.

No monitoring setup catches everything. But the 12 failures above aren't edge cases. They're the most common ways production monitoring fails in practice. If your setup is vulnerable to even a few of them, your green dashboard isn't telling you what you think it is.