How to Monitor an LLM API: What Uptime Tools Won't Tell You
Standard HTTP monitoring checks if your AI endpoint responds. It won't catch latency spikes, 429 rate limits, cold starts, or token exhaustion. Here's what LLM API monitoring actually requires.
Your LLM Endpoint Returns 200. That Tells You Almost Nothing.
Standard uptime monitoring checks whether a URL responds and whether it returns an expected status code. For a traditional API, that's a reasonable proxy for health.
For an LLM endpoint, it's nearly useless.
A 200 response from /v1/chat/completions tells you the service is alive. It doesn't tell you:
- Whether the response came back in 2 seconds or 45 seconds
- Whether you're about to hit your daily token quota
- Whether you're being silently rate limited at the organization level
- Whether the model you requested is actually available or fell back to a different one
- Whether the response content is valid JSON, properly formatted, and non-empty
These are the failure modes that actually break user-facing AI features. And almost none of them show up in a standard HTTP monitor.
The Four Ways LLM APIs Fail (That HTTP Monitoring Misses)
1. Latency Spikes
LLM inference is not like a database query. Response time varies with input token count, output length, model size, infrastructure load, and geographic distance to the model provider's datacenters.
A typical GPT-4o call might take 1.5 seconds under normal load. Under high load, or with a long output, it can take 30–60 seconds. Both return 200. Both look identical to a standard uptime monitor.
From a user experience perspective, they are not identical.
If your AI feature has an acceptable response time of 5 seconds and the model provider is regularly delivering in 15–20 seconds, your users are seeing a broken feature. Your uptime dashboard stays green.
What you actually need to monitor:
- P50, P95, and P99 latency - not just average
- Time-to-first-token (TTFT) separately from total response time, especially for streaming endpoints
- Latency trends over time, not just point-in-time checks
- Latency by input token count, if your use case has variable prompt lengths
A health check that sends a fixed short prompt and measures total response time gives you a consistent baseline. If that baseline starts drifting - 2 seconds becomes 5 seconds, then 8 seconds - something upstream changed.
2. Rate Limits and 429 Errors
Rate limiting from LLM providers is more complex than most APIs.
Most providers enforce limits at multiple levels simultaneously:
- Requests per minute (RPM) - total number of API calls
- Tokens per minute (TPM) - total tokens (input + output) processed per minute
- Tokens per day (TPD) - daily token budget, especially on free tiers
- Organization-level limits - separate from per-key limits, sometimes lower
A 429 response means one of these limits was hit. But which one? And is it a brief burst that will recover in 60 seconds, or a hard daily quota that resets at midnight?
Standard monitoring treats all 4xx responses as errors. But a 429 is a different kind of error than a 404 or a 401. It's temporary, self-resolving, and requires different handling in your application.
What you actually need to monitor:
- Track 429 response rates separately from other error rates
- Alert when 429 rate exceeds a threshold - not on first occurrence
- Monitor token consumption trends if the provider exposes usage headers (
x-ratelimit-remaining-tokens) - Set up a heartbeat that runs a minimal test prompt on a schedule to validate quota is healthy before peak usage
If your application doesn't have alerting specifically for quota exhaustion, you'll find out when users start getting errors - not before.
3. Cold Starts
Several LLM providers and inference platforms spin down compute when idle and restart on demand. This includes:
- Self-hosted models on auto-scaling infrastructure
- Smaller model providers and inference startups
- Fine-tuned models deployed on serverless GPU platforms (Modal, Replicate, Runpod)
- Open-source model deployments on spot infrastructure
Cold start latency can range from a few seconds to over a minute, depending on model size and platform. During a cold start, the API typically returns 200 - it just takes much longer than usual.
For user-facing features, a 45-second cold start is functionally a timeout. Users close the tab, report the feature as broken, or abandon the flow.
What you actually need to monitor:
- Track time-to-first-response, not just whether a response arrived
- Alert when response time exceeds a threshold that indicates a cold start (e.g., >10 seconds for a short prompt)
- For self-hosted deployments: monitor whether GPU workers are warm using a keep-alive heartbeat that fires every few minutes
- Consider a scheduled warm-up request that runs before peak usage hours
4. Degraded or Wrong Responses
This one is the hardest to monitor but often the most impactful.
An LLM can return:
- An empty
choicesarray with a 200 status - A response with
finish_reason: "length"indicating the output was cut off - A malformed JSON response that breaks downstream parsing
- A refusal or safety filter response that doesn't match the expected output format
- A response from the wrong model version if the requested model was unavailable
None of these are 5xx errors. None are 4xx errors. They all return 200. And they all break downstream behavior.
What you actually need to monitor:
- Validate that
choices[0].message.contentis non-empty - Check
finish_reason-"stop"is expected;"length"or"content_filter"may indicate problems - Validate that output matches expected structure (especially for JSON mode or tool-calling responses)
- Alert on elevated rates of truncated responses, which can indicate the provider is under load and reducing output quality
This kind of monitoring is closer to synthetic testing than uptime monitoring. You're not just checking if the endpoint is alive - you're checking if it's producing useful output.
What LLM API Monitoring Actually Looks Like
Here's a practical setup for monitoring a production LLM feature:
Layer 1: Basic Availability (HTTP Monitor)
Use a standard HTTP monitor to check that the endpoint responds at all. Set it up with:
- A short, fixed test prompt (e.g.,
"Reply with 'OK' and nothing else") - An expected response body check for
"OK"or the string you expect - A timeout of 15–20 seconds (longer than a normal API but accounts for variable inference time)
- Alerts on 5xx responses and on timeouts
This catches the basic cases: service is completely down, returning errors, or unresponsive.
Layer 2: Latency Baseline (Response Time Monitoring)
Configure your monitor to track response time trends and alert when they deviate significantly from baseline. Specifically:
- Alert if average response time for your test prompt exceeds 2–3x the historical baseline
- Track this metric weekly - gradual drift often signals infrastructure changes upstream
- For streaming endpoints, measure time to first byte separately
Layer 3: Error Rate Tracking (Keyword + Status Monitoring)
Run a scheduled monitor that:
- Checks for 429 response codes separately from other 4xx/5xx errors
- Validates that the response body contains expected fields (
choices,usage,model) - Checks that
usage.total_tokensis non-zero (a zero token count usually indicates a malformed request or empty response) - Alerts if
finish_reasonin the response is"content_filter"or"length"more than occasionally
Layer 4: Quota Health (Heartbeat / Scheduled Check)
For providers that expose quota information in response headers or via a separate /usage endpoint:
- Set up a daily check that queries current token usage vs. limits
- Run this before your peak usage window - not after you've already hit the limit
- Treat quota at >80% utilization as a warning, not a critical alert
Layer 5: Dependency Status (External Monitor)
Monitor your AI provider's status page directly:
- OpenAI:
https://status.openai.com/api/v2/status.json - Anthropic:
https://status.anthropic.com/api/v2/status.json - Most providers expose a machine-readable status endpoint
Set up an HTTP monitor on this endpoint and alert when status changes from "All Systems Operational". This gives you advance warning of provider-side degradation before it fully impacts your users - and helps you quickly determine whether an incident is on your side or theirs.
The Provider-Side Outage Problem
One of the hardest monitoring challenges for AI-powered applications is distinguishing between your infrastructure failing and your AI provider failing.
Standard monitoring can't tell the difference. Both show up as elevated error rates or latency spikes in your application metrics.
You need two separate monitoring layers:
- Your application endpoint - monitors whether your service is responding correctly end-to-end
- The provider's API directly - monitors whether OpenAI, Anthropic, or whoever you depend on is healthy
When both show problems simultaneously, it's almost certainly the provider. When only your application shows problems, it's almost certainly you.
Without both layers, you'll spend time debugging your infrastructure during provider outages, and miss application-side regressions when the provider is healthy.
Quick Reference: LLM API Failure Modes
| Failure Mode | Status Code | Caught by HTTP Monitor? | What to Actually Check |
|---|---|---|---|
| Service completely down | 503 / 0 | ✅ Yes | Standard HTTP check |
| Rate limit hit | 429 | ⚠️ Only if you check for it | Track 429 rate separately |
| Latency spike / cold start | 200 | ❌ No | Response time threshold alert |
| Quota exhaustion (soft) | 429 | ⚠️ Only if you check for it | Token usage headers / /usage endpoint |
| Empty or truncated output | 200 | ❌ No | Validate choices[0].message.content |
| Wrong model version | 200 | ❌ No | Check model field in response |
| Output cut off | 200 | ❌ No | Check finish_reason != "length" |
| Provider degradation | 200 (slow) | ❌ No | Monitor provider status page |
| Auth token expired | 401 | ✅ Yes | Standard HTTP check |
The Monitoring Gap Is Getting Larger
As more production systems depend on LLM APIs, the gap between "standard uptime monitoring" and "meaningful AI infrastructure monitoring" is growing.
A traditional API either works or it doesn't. Response time variance is usually small and predictable. Error modes are well-understood and well-documented.
LLM APIs are different in almost every dimension. They're probabilistic, slow, expensive per call, and fail in ways that look like success to naive monitoring.
Getting ahead of this means treating LLM API monitoring as its own discipline - not as an afterthought on top of your existing HTTP checks.
Your users will notice the difference before your monitoring does, unless you build the right checks first.