[{"data":1,"prerenderedAt":431},["ShallowReactive",2],{"\u002Fblog\u002Falert-fatigue-assessment":3},{"id":4,"title":5,"author":6,"body":8,"category":419,"date":420,"description":421,"extension":422,"image":423,"lastUpdated":423,"meta":424,"navigation":425,"path":426,"readingTime":427,"seo":428,"stem":429,"__hash__":430},"blog\u002Fblog\u002Falert-fatigue-assessment.md","Alert Fatigue Assessment: Score Your Team's Alert Quality",{"name":7},"Vantaj Team",{"type":9,"value":10,"toc":395},"minimark",[11,16,20,23,27,30,128,132,143,146,153,165,171,175,181,187,193,199,203,208,211,215,218,222,225,247,250,254,257,271,274,278,281,335,339,342,345,348,356,360,364,367,371,374,378,381,385,388,392],[12,13,15],"h2",{"id":14},"take-the-assessment","Take the Assessment",[17,18,19],"p",{},"Answer 8 questions about your team's alerting setup. The calculator scores your alert quality from 0 to 100 and gives you a specific breakdown of what to fix.",[21,22],"alert-fatigue-assessment",{},[12,24,26],{"id":25},"what-this-score-measures","What This Score Measures",[17,28,29],{},"The assessment evaluates seven dimensions of alert health. Each one contributes to your total score based on how much it affects incident response and engineer wellbeing.",[31,32,33,49],"table",{},[34,35,36],"thead",{},[37,38,39,43,46],"tr",{},[40,41,42],"th",{},"Dimension",[40,44,45],{},"Weight",[40,47,48],{},"What it measures",[50,51,52,64,75,86,96,107,117],"tbody",{},[37,53,54,58,61],{},[55,56,57],"td",{},"Signal-to-noise ratio",[55,59,60],{},"25%",[55,62,63],{},"Percentage of alerts that lead to an action",[37,65,66,69,72],{},[55,67,68],{},"Ignore rate",[55,70,71],{},"20%",[55,73,74],{},"How often engineers dismiss alerts without investigating",[37,76,77,80,83],{},[55,78,79],{},"Duplicate rate",[55,81,82],{},"15%",[55,84,85],{},"How many alerts fire for the same underlying problem",[37,87,88,91,93],{},[55,89,90],{},"Acknowledgment speed",[55,92,82],{},[55,94,95],{},"Time from alert to someone claiming ownership",[37,97,98,101,104],{},[55,99,100],{},"Tuning discipline",[55,102,103],{},"10%",[55,105,106],{},"Whether your team reviews and adjusts thresholds",[37,108,109,112,114],{},[55,110,111],{},"Volume per engineer",[55,113,103],{},[55,115,116],{},"Weekly alert load divided across on-call rotation",[37,118,119,122,125],{},[55,120,121],{},"Off-hours burden",[55,123,124],{},"5%",[55,126,127],{},"How much of your alert volume wakes people up at night",[12,129,131],{"id":130},"why-alert-fatigue-kills-reliability","Why Alert Fatigue Kills Reliability",[17,133,134,135,142],{},"PagerDuty's 2023 State of Digital Operations report found that teams receive an average of 4,300 alerts per month. Of those, fewer than half require any response (Source: ",[136,137,141],"a",{"href":138,"rel":139},"https:\u002F\u002Fwww.pagerduty.com\u002Fresources\u002Freports\u002Fdigital-operations\u002F",[140],"nofollow","PagerDuty, 2023",").",[17,144,145],{},"That means engineers spend more time dismissing noise than responding to real incidents. The downstream effects compound:",[17,147,148,152],{},[149,150,151],"strong",{},"Response times degrade."," When 60% of alerts are false positives, the rational response is to stop treating alerts as urgent. Median acknowledgment time creeps from 3 minutes to 15, then to 30. By the time a real P1 fires, the instinct to act fast has been worn away.",[17,154,155,158,159,164],{},[149,156,157],{},"Engineers leave."," Catchpoint's 2024 SRE survey reported that 66% of SREs experience burnout related to on-call duties, and excessive alerting is the primary driver (Source: ",[136,160,163],{"href":161,"rel":162},"https:\u002F\u002Fwww.catchpoint.com\u002Fsre-report",[140],"Catchpoint SRE Report, 2024","). Replacing a senior engineer costs 6 to 9 months of salary. Alert fatigue is a retention problem disguised as a monitoring problem.",[17,166,167,170],{},[149,168,169],{},"Incidents get longer."," Google's SRE Workbook documents that teams with alert-to-incident ratios above 5:1 take 40% longer to resolve real outages. The noise buries the signal.",[12,172,174],{"id":173},"how-to-interpret-your-score","How to Interpret Your Score",[17,176,177,180],{},[149,178,179],{},"80-100 (Healthy):"," Your alerting setup is working. Most alerts are actionable, duplicates are controlled, and acknowledgment times are fast. Maintain your tuning schedule and revisit quarterly.",[17,182,183,186],{},[149,184,185],{},"60-79 (Needs work):"," You have specific weak spots. The breakdown shows which dimensions drag your score down. Start with the highest-weighted problem and fix it before moving to the next.",[17,188,189,192],{},[149,190,191],{},"40-59 (Fatigued):"," Your team has started to distrust the alerting system. Engineers are ignoring notifications, and response times have crept up. You need a focused cleanup: audit every alert, delete those with no runbook, and consolidate duplicates.",[17,194,195,198],{},[149,196,197],{},"0-39 (Critical):"," Alerts have become background noise. A P1 incident will take longer to detect and resolve because the system has lost credibility. Pause and rebuild: start from zero, add alerts back one at a time with documented thresholds and response procedures.",[12,200,202],{"id":201},"five-fixes-that-move-the-score","Five Fixes That Move the Score",[204,205,207],"h3",{"id":206},"_1-delete-alerts-that-have-no-defined-response","1. Delete alerts that have no defined response",[17,209,210],{},"Open your alert configuration. For each alert, ask: \"If this fires at 3 AM, what does the on-call engineer do?\" If the answer is \"check the dashboard and wait,\" delete the alert or downgrade it to a log entry. An alert without a response procedure is noise.",[204,212,214],{"id":213},"_2-group-correlated-alerts-into-a-single-incident","2. Group correlated alerts into a single incident",[17,216,217],{},"A database failover should not produce 12 separate alerts from the database, the connection pool, the API, the frontend, the health check, and the load balancer. Configure alert grouping rules so one failure produces one notification. Vantaj's multi-region consensus approach checks from multiple locations before alerting, which eliminates transient false positives at the source.",[204,219,221],{"id":220},"_3-set-tiered-severity-with-different-routing","3. Set tiered severity with different routing",[17,223,224],{},"Not every alert should page the on-call engineer. Create three tiers:",[226,227,228,235,241],"ul",{},[229,230,231,234],"li",{},[149,232,233],{},"P1 (page immediately):"," User-facing outage, data loss risk, SLA breach imminent",[229,236,237,240],{},[149,238,239],{},"P2 (Slack notification, ack within 30 min):"," Degraded performance, partial outage, elevated error rate",[229,242,243,246],{},[149,244,245],{},"P3 (next business day):"," Capacity warning, certificate expiring in 14 days, non-critical service degradation",[17,248,249],{},"Most teams have one tier: \"send everything to PagerDuty.\" That guarantees fatigue.",[204,251,253],{"id":252},"_4-schedule-monthly-alert-reviews","4. Schedule monthly alert reviews",[17,255,256],{},"Put a recurring 30-minute meeting on the calendar. In each review:",[226,258,259,262,265,268],{},[229,260,261],{},"Delete alerts that fired but never required action in the past 30 days",[229,263,264],{},"Adjust thresholds that fire too frequently (tighten the window, raise the threshold)",[229,266,267],{},"Consolidate alerts that always fire together into a single grouped alert",[229,269,270],{},"Check if new alerts are needed for recent incidents that lacked detection",[17,272,273],{},"Teams that do this consistently see alert volume drop 30-50% within three months without losing coverage.",[204,275,277],{"id":276},"_5-track-alert-to-incident-ratio-monthly","5. Track alert-to-incident ratio monthly",[17,279,280],{},"Count your total alerts and your total real incidents. Divide. If the ratio exceeds 5:1, you have a noise problem. Track this metric monthly and set a target to bring it under 3:1.",[31,282,283,293],{},[34,284,285],{},[37,286,287,290],{},[40,288,289],{},"Ratio",[40,291,292],{},"Interpretation",[50,294,295,303,311,319,327],{},[37,296,297,300],{},[55,298,299],{},"Under 2:1",[55,301,302],{},"Tight. Possible gap in coverage (too few alerts)",[37,304,305,308],{},[55,306,307],{},"2:1 to 3:1",[55,309,310],{},"Healthy. Most alerts correspond to real problems",[37,312,313,316],{},[55,314,315],{},"3:1 to 5:1",[55,317,318],{},"Noisy. Review and prune low-signal alerts",[37,320,321,324],{},[55,322,323],{},"5:1 to 10:1",[55,325,326],{},"Fatigued. Engineers are likely ignoring some alerts",[37,328,329,332],{},[55,330,331],{},"Above 10:1",[55,333,334],{},"Critical. Alerting system has lost credibility",[12,336,338],{"id":337},"how-vantaj-reduces-alert-noise","How Vantaj Reduces Alert Noise",[17,340,341],{},"Vantaj checks your services from multiple geographic regions on every cycle. An alert fires only when the majority of regions confirm the failure. A single-region network blip does not page your engineer at 3 AM.",[17,343,344],{},"Each monitor supports configurable confirmation thresholds, retry intervals, and escalation policies. You define what counts as a real problem, and Vantaj enforces that definition before anyone gets notified.",[17,346,347],{},"Combined with SSL expiry alerts (with configurable lead times), domain expiry monitoring, and heartbeat checks for cron jobs and background workers, you cover the full stack without stacking up redundant alerts from multiple tools.",[17,349,350,355],{},[136,351,354],{"href":352,"rel":353},"https:\u002F\u002Fapp.vantaj.co\u002Fregister",[140],"Start monitoring free"," and keep your alert quality score above 80.",[12,357,359],{"id":358},"faq","FAQ",[204,361,363],{"id":362},"what-is-a-good-alert-to-incident-ratio","What is a good alert-to-incident ratio?",[17,365,366],{},"Between 2:1 and 3:1. This means most alerts correspond to real problems that require investigation. Below 2:1 may indicate insufficient monitoring coverage. Above 5:1 indicates alert fatigue is setting in.",[204,368,370],{"id":369},"how-many-alerts-per-week-is-too-many","How many alerts per week is too many?",[17,372,373],{},"It depends on team size. As a benchmark, more than 25 alerts per on-call engineer per week degrades response quality. At 50+ per engineer per week, on-call rotations become unsustainable and turnover risk increases.",[204,375,377],{"id":376},"should-every-alert-page-someone","Should every alert page someone?",[17,379,380],{},"No. Only P1 alerts (user-facing outage, data loss risk, imminent SLA breach) should trigger a page. P2 alerts should go to a Slack channel with a 30-minute acknowledgment window. P3 alerts should queue for the next business day.",[204,382,384],{"id":383},"how-often-should-we-review-alert-thresholds","How often should we review alert thresholds?",[17,386,387],{},"Monthly. Alert conditions drift as traffic patterns, infrastructure, and application behavior change. A threshold that was accurate six months ago may fire constantly or miss real problems today.",[204,389,391],{"id":390},"what-is-the-difference-between-alert-fatigue-and-on-call-burnout","What is the difference between alert fatigue and on-call burnout?",[17,393,394],{},"Alert fatigue is the specific condition where engineers stop responding to alerts because too many of them are false positives. On-call burnout is broader and includes fatigue from incident response, sleep disruption, and the stress of being paged. Alert fatigue is the leading contributor to on-call burnout.",{"title":396,"searchDepth":397,"depth":397,"links":398},"",2,[399,400,401,402,403,411,412],{"id":14,"depth":397,"text":15},{"id":25,"depth":397,"text":26},{"id":130,"depth":397,"text":131},{"id":173,"depth":397,"text":174},{"id":201,"depth":397,"text":202,"children":404},[405,407,408,409,410],{"id":206,"depth":406,"text":207},3,{"id":213,"depth":406,"text":214},{"id":220,"depth":406,"text":221},{"id":252,"depth":406,"text":253},{"id":276,"depth":406,"text":277},{"id":337,"depth":397,"text":338},{"id":358,"depth":397,"text":359,"children":413},[414,415,416,417,418],{"id":362,"depth":406,"text":363},{"id":369,"depth":406,"text":370},{"id":376,"depth":406,"text":377},{"id":383,"depth":406,"text":384},{"id":390,"depth":406,"text":391},"tutorials","2026-06-26","Answer 8 questions about your alerting setup and get a scored breakdown of signal-to-noise ratio, duplicate rate, ignore rate, and on-call load. Free, no signup required.","md",null,{},true,"\u002Fblog\u002Falert-fatigue-assessment",7,{"title":5,"description":421},"blog\u002Falert-fatigue-assessment","IuG77siEsUDDgjybqA6sxdI2r57tJCy5IMbW2jHbueY",1782490320746]