[{"data":1,"prerenderedAt":677},["ShallowReactive",2],{"\u002Fblog\u002Fmonitoring-llm-api":3},{"id":4,"title":5,"author":6,"body":8,"category":665,"date":666,"description":667,"extension":668,"image":669,"lastUpdated":669,"meta":670,"navigation":671,"path":672,"readingTime":673,"seo":674,"stem":675,"__hash__":676},"blog\u002Fblog\u002Fmonitoring-llm-api.md","How to Monitor an LLM API: What Uptime Tools Won't Tell You",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":643},"minimark",[11,16,20,23,31,50,53,57,62,65,68,71,74,80,94,97,101,104,107,133,136,139,143,161,164,168,171,185,188,191,195,209,213,216,219,244,247,251,285,288,292,295,299,302,323,326,330,333,344,348,351,386,390,397,408,412,415,432,439,443,446,449,452,467,470,473,477,624,628,631,634,637,640],[12,13,15],"h2",{"id":14},"your-llm-endpoint-returns-200-that-tells-you-almost-nothing","Your LLM Endpoint Returns 200. That Tells You Almost Nothing.",[17,18,19],"p",{},"Standard uptime monitoring checks whether a URL responds and whether it returns an expected status code. For a traditional API, that's a reasonable proxy for health.",[17,21,22],{},"For an LLM endpoint, it's nearly useless.",[17,24,25,26,30],{},"A 200 response from ",[27,28,29],"code",{},"\u002Fv1\u002Fchat\u002Fcompletions"," tells you the service is alive. It doesn't tell you:",[32,33,34,38,41,44,47],"ul",{},[35,36,37],"li",{},"Whether the response came back in 2 seconds or 45 seconds",[35,39,40],{},"Whether you're about to hit your daily token quota",[35,42,43],{},"Whether you're being silently rate limited at the organization level",[35,45,46],{},"Whether the model you requested is actually available or fell back to a different one",[35,48,49],{},"Whether the response content is valid JSON, properly formatted, and non-empty",[17,51,52],{},"These are the failure modes that actually break user-facing AI features. And almost none of them show up in a standard HTTP monitor.",[12,54,56],{"id":55},"the-four-ways-llm-apis-fail-that-http-monitoring-misses","The Four Ways LLM APIs Fail (That HTTP Monitoring Misses)",[58,59,61],"h3",{"id":60},"_1-latency-spikes","1. Latency Spikes",[17,63,64],{},"LLM inference is not like a database query. Response time varies with input token count, output length, model size, infrastructure load, and geographic distance to the model provider's datacenters.",[17,66,67],{},"A typical GPT-4o call might take 1.5 seconds under normal load. Under high load, or with a long output, it can take 30–60 seconds. Both return 200. Both look identical to a standard uptime monitor.",[17,69,70],{},"From a user experience perspective, they are not identical.",[17,72,73],{},"If your AI feature has an acceptable response time of 5 seconds and the model provider is regularly delivering in 15–20 seconds, your users are seeing a broken feature. Your uptime dashboard stays green.",[17,75,76],{},[77,78,79],"strong",{},"What you actually need to monitor:",[32,81,82,85,88,91],{},[35,83,84],{},"P50, P95, and P99 latency - not just average",[35,86,87],{},"Time-to-first-token (TTFT) separately from total response time, especially for streaming endpoints",[35,89,90],{},"Latency trends over time, not just point-in-time checks",[35,92,93],{},"Latency by input token count, if your use case has variable prompt lengths",[17,95,96],{},"A health check that sends a fixed short prompt and measures total response time gives you a consistent baseline. If that baseline starts drifting - 2 seconds becomes 5 seconds, then 8 seconds - something upstream changed.",[58,98,100],{"id":99},"_2-rate-limits-and-429-errors","2. Rate Limits and 429 Errors",[17,102,103],{},"Rate limiting from LLM providers is more complex than most APIs.",[17,105,106],{},"Most providers enforce limits at multiple levels simultaneously:",[32,108,109,115,121,127],{},[35,110,111,114],{},[77,112,113],{},"Requests per minute (RPM)"," - total number of API calls",[35,116,117,120],{},[77,118,119],{},"Tokens per minute (TPM)"," - total tokens (input + output) processed per minute",[35,122,123,126],{},[77,124,125],{},"Tokens per day (TPD)"," - daily token budget, especially on free tiers",[35,128,129,132],{},[77,130,131],{},"Organization-level limits"," - separate from per-key limits, sometimes lower",[17,134,135],{},"A 429 response means one of these limits was hit. But which one? And is it a brief burst that will recover in 60 seconds, or a hard daily quota that resets at midnight?",[17,137,138],{},"Standard monitoring treats all 4xx responses as errors. But a 429 is a different kind of error than a 404 or a 401. It's temporary, self-resolving, and requires different handling in your application.",[17,140,141],{},[77,142,79],{},[32,144,145,148,151,158],{},[35,146,147],{},"Track 429 response rates separately from other error rates",[35,149,150],{},"Alert when 429 rate exceeds a threshold - not on first occurrence",[35,152,153,154,157],{},"Monitor token consumption trends if the provider exposes usage headers (",[27,155,156],{},"x-ratelimit-remaining-tokens",")",[35,159,160],{},"Set up a heartbeat that runs a minimal test prompt on a schedule to validate quota is healthy before peak usage",[17,162,163],{},"If your application doesn't have alerting specifically for quota exhaustion, you'll find out when users start getting errors - not before.",[58,165,167],{"id":166},"_3-cold-starts","3. Cold Starts",[17,169,170],{},"Several LLM providers and inference platforms spin down compute when idle and restart on demand. This includes:",[32,172,173,176,179,182],{},[35,174,175],{},"Self-hosted models on auto-scaling infrastructure",[35,177,178],{},"Smaller model providers and inference startups",[35,180,181],{},"Fine-tuned models deployed on serverless GPU platforms (Modal, Replicate, Runpod)",[35,183,184],{},"Open-source model deployments on spot infrastructure",[17,186,187],{},"Cold start latency can range from a few seconds to over a minute, depending on model size and platform. During a cold start, the API typically returns 200 - it just takes much longer than usual.",[17,189,190],{},"For user-facing features, a 45-second cold start is functionally a timeout. Users close the tab, report the feature as broken, or abandon the flow.",[17,192,193],{},[77,194,79],{},[32,196,197,200,203,206],{},[35,198,199],{},"Track time-to-first-response, not just whether a response arrived",[35,201,202],{},"Alert when response time exceeds a threshold that indicates a cold start (e.g., >10 seconds for a short prompt)",[35,204,205],{},"For self-hosted deployments: monitor whether GPU workers are warm using a keep-alive heartbeat that fires every few minutes",[35,207,208],{},"Consider a scheduled warm-up request that runs before peak usage hours",[58,210,212],{"id":211},"_4-degraded-or-wrong-responses","4. Degraded or Wrong Responses",[17,214,215],{},"This one is the hardest to monitor but often the most impactful.",[17,217,218],{},"An LLM can return:",[32,220,221,228,235,238,241],{},[35,222,223,224,227],{},"An empty ",[27,225,226],{},"choices"," array with a 200 status",[35,229,230,231,234],{},"A response with ",[27,232,233],{},"finish_reason: \"length\""," indicating the output was cut off",[35,236,237],{},"A malformed JSON response that breaks downstream parsing",[35,239,240],{},"A refusal or safety filter response that doesn't match the expected output format",[35,242,243],{},"A response from the wrong model version if the requested model was unavailable",[17,245,246],{},"None of these are 5xx errors. None are 4xx errors. They all return 200. And they all break downstream behavior.",[17,248,249],{},[77,250,79],{},[32,252,253,260,279,282],{},[35,254,255,256,259],{},"Validate that ",[27,257,258],{},"choices[0].message.content"," is non-empty",[35,261,262,263,266,267,270,271,274,275,278],{},"Check ",[27,264,265],{},"finish_reason"," - ",[27,268,269],{},"\"stop\""," is expected; ",[27,272,273],{},"\"length\""," or ",[27,276,277],{},"\"content_filter\""," may indicate problems",[35,280,281],{},"Validate that output matches expected structure (especially for JSON mode or tool-calling responses)",[35,283,284],{},"Alert on elevated rates of truncated responses, which can indicate the provider is under load and reducing output quality",[17,286,287],{},"This kind of monitoring is closer to synthetic testing than uptime monitoring. You're not just checking if the endpoint is alive - you're checking if it's producing useful output.",[12,289,291],{"id":290},"what-llm-api-monitoring-actually-looks-like","What LLM API Monitoring Actually Looks Like",[17,293,294],{},"Here's a practical setup for monitoring a production LLM feature:",[58,296,298],{"id":297},"layer-1-basic-availability-http-monitor","Layer 1: Basic Availability (HTTP Monitor)",[17,300,301],{},"Use a standard HTTP monitor to check that the endpoint responds at all. Set it up with:",[32,303,304,310,317,320],{},[35,305,306,307,157],{},"A short, fixed test prompt (e.g., ",[27,308,309],{},"\"Reply with 'OK' and nothing else\"",[35,311,312,313,316],{},"An expected response body check for ",[27,314,315],{},"\"OK\""," or the string you expect",[35,318,319],{},"A timeout of 15–20 seconds (longer than a normal API but accounts for variable inference time)",[35,321,322],{},"Alerts on 5xx responses and on timeouts",[17,324,325],{},"This catches the basic cases: service is completely down, returning errors, or unresponsive.",[58,327,329],{"id":328},"layer-2-latency-baseline-response-time-monitoring","Layer 2: Latency Baseline (Response Time Monitoring)",[17,331,332],{},"Configure your monitor to track response time trends and alert when they deviate significantly from baseline. Specifically:",[32,334,335,338,341],{},[35,336,337],{},"Alert if average response time for your test prompt exceeds 2–3x the historical baseline",[35,339,340],{},"Track this metric weekly - gradual drift often signals infrastructure changes upstream",[35,342,343],{},"For streaming endpoints, measure time to first byte separately",[58,345,347],{"id":346},"layer-3-error-rate-tracking-keyword-status-monitoring","Layer 3: Error Rate Tracking (Keyword + Status Monitoring)",[17,349,350],{},"Run a scheduled monitor that:",[32,352,353,356,368,375],{},[35,354,355],{},"Checks for 429 response codes separately from other 4xx\u002F5xx errors",[35,357,358,359,361,362,361,365,157],{},"Validates that the response body contains expected fields (",[27,360,226],{},", ",[27,363,364],{},"usage",[27,366,367],{},"model",[35,369,370,371,374],{},"Checks that ",[27,372,373],{},"usage.total_tokens"," is non-zero (a zero token count usually indicates a malformed request or empty response)",[35,376,377,378,380,381,274,383,385],{},"Alerts if ",[27,379,265],{}," in the response is ",[27,382,277],{},[27,384,273],{}," more than occasionally",[58,387,389],{"id":388},"layer-4-quota-health-heartbeat-scheduled-check","Layer 4: Quota Health (Heartbeat \u002F Scheduled Check)",[17,391,392,393,396],{},"For providers that expose quota information in response headers or via a separate ",[27,394,395],{},"\u002Fusage"," endpoint:",[32,398,399,402,405],{},[35,400,401],{},"Set up a daily check that queries current token usage vs. limits",[35,403,404],{},"Run this before your peak usage window - not after you've already hit the limit",[35,406,407],{},"Treat quota at >80% utilization as a warning, not a critical alert",[58,409,411],{"id":410},"layer-5-dependency-status-external-monitor","Layer 5: Dependency Status (External Monitor)",[17,413,414],{},"Monitor your AI provider's status page directly:",[32,416,417,423,429],{},[35,418,419,420],{},"OpenAI: ",[27,421,422],{},"https:\u002F\u002Fstatus.openai.com\u002Fapi\u002Fv2\u002Fstatus.json",[35,424,425,426],{},"Anthropic: ",[27,427,428],{},"https:\u002F\u002Fstatus.anthropic.com\u002Fapi\u002Fv2\u002Fstatus.json",[35,430,431],{},"Most providers expose a machine-readable status endpoint",[17,433,434,435,438],{},"Set up an HTTP monitor on this endpoint and alert when status changes from ",[27,436,437],{},"\"All Systems Operational\"",". This gives you advance warning of provider-side degradation before it fully impacts your users - and helps you quickly determine whether an incident is on your side or theirs.",[12,440,442],{"id":441},"the-provider-side-outage-problem","The Provider-Side Outage Problem",[17,444,445],{},"One of the hardest monitoring challenges for AI-powered applications is distinguishing between your infrastructure failing and your AI provider failing.",[17,447,448],{},"Standard monitoring can't tell the difference. Both show up as elevated error rates or latency spikes in your application metrics.",[17,450,451],{},"You need two separate monitoring layers:",[453,454,455,461],"ol",{},[35,456,457,460],{},[77,458,459],{},"Your application endpoint"," - monitors whether your service is responding correctly end-to-end",[35,462,463,466],{},[77,464,465],{},"The provider's API directly"," - monitors whether OpenAI, Anthropic, or whoever you depend on is healthy",[17,468,469],{},"When both show problems simultaneously, it's almost certainly the provider. When only your application shows problems, it's almost certainly you.",[17,471,472],{},"Without both layers, you'll spend time debugging your infrastructure during provider outages, and miss application-side regressions when the provider is healthy.",[12,474,476],{"id":475},"quick-reference-llm-api-failure-modes","Quick Reference: LLM API Failure Modes",[478,479,480,499],"table",{},[481,482,483],"thead",{},[484,485,486,490,493,496],"tr",{},[487,488,489],"th",{},"Failure Mode",[487,491,492],{},"Status Code",[487,494,495],{},"Caught by HTTP Monitor?",[487,497,498],{},"What to Actually Check",[500,501,502,517,531,545,557,571,585,599,612],"tbody",{},[484,503,504,508,511,514],{},[505,506,507],"td",{},"Service completely down",[505,509,510],{},"503 \u002F 0",[505,512,513],{},"✅ Yes",[505,515,516],{},"Standard HTTP check",[484,518,519,522,525,528],{},[505,520,521],{},"Rate limit hit",[505,523,524],{},"429",[505,526,527],{},"⚠️ Only if you check for it",[505,529,530],{},"Track 429 rate separately",[484,532,533,536,539,542],{},[505,534,535],{},"Latency spike \u002F cold start",[505,537,538],{},"200",[505,540,541],{},"❌ No",[505,543,544],{},"Response time threshold alert",[484,546,547,550,552,554],{},[505,548,549],{},"Quota exhaustion (soft)",[505,551,524],{},[505,553,527],{},[505,555,556],{},"Token usage headers \u002F \u002Fusage endpoint",[484,558,559,562,564,566],{},[505,560,561],{},"Empty or truncated output",[505,563,538],{},[505,565,541],{},[505,567,568,569],{},"Validate ",[27,570,258],{},[484,572,573,576,578,580],{},[505,574,575],{},"Wrong model version",[505,577,538],{},[505,579,541],{},[505,581,262,582,584],{},[27,583,367],{}," field in response",[484,586,587,590,592,594],{},[505,588,589],{},"Output cut off",[505,591,538],{},[505,593,541],{},[505,595,262,596],{},[27,597,598],{},"finish_reason != \"length\"",[484,600,601,604,607,609],{},[505,602,603],{},"Provider degradation",[505,605,606],{},"200 (slow)",[505,608,541],{},[505,610,611],{},"Monitor provider status page",[484,613,614,617,620,622],{},[505,615,616],{},"Auth token expired",[505,618,619],{},"401",[505,621,513],{},[505,623,516],{},[12,625,627],{"id":626},"the-monitoring-gap-is-getting-larger","The Monitoring Gap Is Getting Larger",[17,629,630],{},"As more production systems depend on LLM APIs, the gap between \"standard uptime monitoring\" and \"meaningful AI infrastructure monitoring\" is growing.",[17,632,633],{},"A traditional API either works or it doesn't. Response time variance is usually small and predictable. Error modes are well-understood and well-documented.",[17,635,636],{},"LLM APIs are different in almost every dimension. They're probabilistic, slow, expensive per call, and fail in ways that look like success to naive monitoring.",[17,638,639],{},"Getting ahead of this means treating LLM API monitoring as its own discipline - not as an afterthought on top of your existing HTTP checks.",[17,641,642],{},"Your users will notice the difference before your monitoring does, unless you build the right checks first.",{"title":644,"searchDepth":645,"depth":645,"links":646},"",2,[647,648,655,662,663,664],{"id":14,"depth":645,"text":15},{"id":55,"depth":645,"text":56,"children":649},[650,652,653,654],{"id":60,"depth":651,"text":61},3,{"id":99,"depth":651,"text":100},{"id":166,"depth":651,"text":167},{"id":211,"depth":651,"text":212},{"id":290,"depth":645,"text":291,"children":656},[657,658,659,660,661],{"id":297,"depth":651,"text":298},{"id":328,"depth":651,"text":329},{"id":346,"depth":651,"text":347},{"id":388,"depth":651,"text":389},{"id":410,"depth":651,"text":411},{"id":441,"depth":645,"text":442},{"id":475,"depth":645,"text":476},{"id":626,"depth":645,"text":627},"infrastructure","2026-06-23","Standard HTTP monitoring checks if your AI endpoint responds. It won't catch latency spikes, 429 rate limits, cold starts, or token exhaustion. Here's what LLM API monitoring actually requires.","md",null,{},true,"\u002Fblog\u002Fmonitoring-llm-api",11,{"title":5,"description":667},"blog\u002Fmonitoring-llm-api","6JEIITfjBUZ9r5bsPt-GMmb6q4R6hpl1xz0fdKmtKIc",1782237389272]