How to Monitor an LLM API: What Uptime Tools Won't Tell You

Your LLM Endpoint Returns 200. That Tells You Almost Nothing.

Standard uptime monitoring checks whether a URL responds and whether it returns an expected status code. For a traditional API, that's a reasonable proxy for health.

For an LLM endpoint, it's nearly useless.

A 200 response from /v1/chat/completions tells you the service is alive. It doesn't tell you:

Whether the response came back in 2 seconds or 45 seconds
Whether you're about to hit your daily token quota
Whether you're being silently rate limited at the organization level
Whether the model you requested is actually available or fell back to a different one
Whether the response content is valid JSON, properly formatted, and non-empty

These are the failure modes that actually break user-facing AI features. And almost none of them show up in a standard HTTP monitor.

The Four Ways LLM APIs Fail (That HTTP Monitoring Misses)

1. Latency Spikes

LLM inference is not like a database query. Response time varies with input token count, output length, model size, infrastructure load, and geographic distance to the model provider's datacenters.

A typical GPT-4o call might take 1.5 seconds under normal load. Under high load, or with a long output, it can take 30–60 seconds. Both return 200. Both look identical to a standard uptime monitor.

From a user experience perspective, they are not identical.

If your AI feature has an acceptable response time of 5 seconds and the model provider is regularly delivering in 15–20 seconds, your users are seeing a broken feature. Your uptime dashboard stays green.

What you actually need to monitor:

P50, P95, and P99 latency - not just average
Time-to-first-token (TTFT) separately from total response time, especially for streaming endpoints
Latency trends over time, not just point-in-time checks
Latency by input token count, if your use case has variable prompt lengths

A health check that sends a fixed short prompt and measures total response time gives you a consistent baseline. If that baseline starts drifting - 2 seconds becomes 5 seconds, then 8 seconds - something upstream changed.

2. Rate Limits and 429 Errors

Rate limiting from LLM providers is more complex than most APIs.

Most providers enforce limits at multiple levels simultaneously:

Requests per minute (RPM) - total number of API calls
Tokens per minute (TPM) - total tokens (input + output) processed per minute
Tokens per day (TPD) - daily token budget, especially on free tiers
Organization-level limits - separate from per-key limits, sometimes lower

A 429 response means one of these limits was hit. But which one? And is it a brief burst that will recover in 60 seconds, or a hard daily quota that resets at midnight?

Standard monitoring treats all 4xx responses as errors. But a 429 is a different kind of error than a 404 or a 401. It's temporary, self-resolving, and requires different handling in your application.

What you actually need to monitor:

Track 429 response rates separately from other error rates
Alert when 429 rate exceeds a threshold - not on first occurrence
Monitor token consumption trends if the provider exposes usage headers (x-ratelimit-remaining-tokens)
Set up a heartbeat that runs a minimal test prompt on a schedule to validate quota is healthy before peak usage

If your application doesn't have alerting specifically for quota exhaustion, you'll find out when users start getting errors - not before.

3. Cold Starts

Several LLM providers and inference platforms spin down compute when idle and restart on demand. This includes:

Self-hosted models on auto-scaling infrastructure
Smaller model providers and inference startups
Fine-tuned models deployed on serverless GPU platforms (Modal, Replicate, Runpod)
Open-source model deployments on spot infrastructure

Cold start latency can range from a few seconds to over a minute, depending on model size and platform. During a cold start, the API typically returns 200 - it just takes much longer than usual.

For user-facing features, a 45-second cold start is functionally a timeout. Users close the tab, report the feature as broken, or abandon the flow.

What you actually need to monitor:

Track time-to-first-response, not just whether a response arrived
Alert when response time exceeds a threshold that indicates a cold start (e.g., >10 seconds for a short prompt)
For self-hosted deployments: monitor whether GPU workers are warm using a keep-alive heartbeat that fires every few minutes
Consider a scheduled warm-up request that runs before peak usage hours

4. Degraded or Wrong Responses

This one is the hardest to monitor but often the most impactful.

An LLM can return:

An empty choices array with a 200 status
A response with finish_reason: "length" indicating the output was cut off
A malformed JSON response that breaks downstream parsing
A refusal or safety filter response that doesn't match the expected output format
A response from the wrong model version if the requested model was unavailable

None of these are 5xx errors. None are 4xx errors. They all return 200. And they all break downstream behavior.

What you actually need to monitor:

Validate that choices[0].message.content is non-empty
Check finish_reason - "stop" is expected; "length" or "content_filter" may indicate problems
Validate that output matches expected structure (especially for JSON mode or tool-calling responses)
Alert on elevated rates of truncated responses, which can indicate the provider is under load and reducing output quality

This kind of monitoring is closer to synthetic testing than uptime monitoring. You're not just checking if the endpoint is alive - you're checking if it's producing useful output.

What LLM API Monitoring Actually Looks Like

Here's a practical setup for monitoring a production LLM feature:

Layer 1: Basic Availability (HTTP Monitor)

Use a standard HTTP monitor to check that the endpoint responds at all. Set it up with:

A short, fixed test prompt (e.g., "Reply with 'OK' and nothing else")
An expected response body check for "OK" or the string you expect
A timeout of 15–20 seconds (longer than a normal API but accounts for variable inference time)
Alerts on 5xx responses and on timeouts

This catches the basic cases: service is completely down, returning errors, or unresponsive.

Layer 2: Latency Baseline (Response Time Monitoring)

Configure your monitor to track response time trends and alert when they deviate significantly from baseline. Specifically:

Alert if average response time for your test prompt exceeds 2–3x the historical baseline
Track this metric weekly - gradual drift often signals infrastructure changes upstream
For streaming endpoints, measure time to first byte separately

Layer 3: Error Rate Tracking (Keyword + Status Monitoring)

Run a scheduled monitor that:

Checks for 429 response codes separately from other 4xx/5xx errors
Validates that the response body contains expected fields (choices, usage, model)
Checks that usage.total_tokens is non-zero (a zero token count usually indicates a malformed request or empty response)
Alerts if finish_reason in the response is "content_filter" or "length" more than occasionally

Layer 4: Quota Health (Heartbeat / Scheduled Check)

For providers that expose quota information in response headers or via a separate /usage endpoint:

Set up a daily check that queries current token usage vs. limits
Run this before your peak usage window - not after you've already hit the limit
Treat quota at >80% utilization as a warning, not a critical alert

Layer 5: Dependency Status (External Monitor)

Monitor your AI provider's status page directly:

OpenAI: https://status.openai.com/api/v2/status.json
Anthropic: https://status.anthropic.com/api/v2/status.json
Most providers expose a machine-readable status endpoint

Set up an HTTP monitor on this endpoint and alert when status changes from "All Systems Operational". This gives you advance warning of provider-side degradation before it fully impacts your users - and helps you quickly determine whether an incident is on your side or theirs.

The Provider-Side Outage Problem

One of the hardest monitoring challenges for AI-powered applications is distinguishing between your infrastructure failing and your AI provider failing.

Standard monitoring can't tell the difference. Both show up as elevated error rates or latency spikes in your application metrics.

You need two separate monitoring layers:

Your application endpoint - monitors whether your service is responding correctly end-to-end
The provider's API directly - monitors whether OpenAI, Anthropic, or whoever you depend on is healthy

When both show problems simultaneously, it's almost certainly the provider. When only your application shows problems, it's almost certainly you.

Without both layers, you'll spend time debugging your infrastructure during provider outages, and miss application-side regressions when the provider is healthy.

Quick Reference: LLM API Failure Modes

Failure Mode	Status Code	Caught by HTTP Monitor?	What to Actually Check
Service completely down	503 / 0	✅ Yes	Standard HTTP check
Rate limit hit	429	⚠️ Only if you check for it	Track 429 rate separately
Latency spike / cold start	200	❌ No	Response time threshold alert
Quota exhaustion (soft)	429	⚠️ Only if you check for it	Token usage headers / /usage endpoint
Empty or truncated output	200	❌ No	Validate `choices[0].message.content`
Wrong model version	200	❌ No	Check `model` field in response
Output cut off	200	❌ No	Check `finish_reason != "length"`
Provider degradation	200 (slow)	❌ No	Monitor provider status page
Auth token expired	401	✅ Yes	Standard HTTP check

The Monitoring Gap Is Getting Larger

As more production systems depend on LLM APIs, the gap between "standard uptime monitoring" and "meaningful AI infrastructure monitoring" is growing.

A traditional API either works or it doesn't. Response time variance is usually small and predictable. Error modes are well-understood and well-documented.

LLM APIs are different in almost every dimension. They're probabilistic, slow, expensive per call, and fail in ways that look like success to naive monitoring.

Getting ahead of this means treating LLM API monitoring as its own discipline - not as an afterthought on top of your existing HTTP checks.

Your users will notice the difference before your monitoring does, unless you build the right checks first.