[{"data":1,"prerenderedAt":648},["ShallowReactive",2],{"\u002Fblog\u002Fmttr-mttd-mtbf-incident-metrics":3},{"id":4,"title":5,"author":6,"body":8,"category":636,"date":637,"description":638,"extension":639,"image":640,"lastUpdated":637,"meta":641,"navigation":642,"path":643,"readingTime":644,"seo":645,"stem":646,"__hash__":647},"blog\u002Fblog\u002Fmttr-mttd-mtbf-incident-metrics.md","MTTR, MTTD, MTBF: The Incident Metrics That Tell You If Your Monitoring Works",{"name":7},"Theo Cummings",{"type":9,"value":10,"toc":611},"minimark",[11,15,31,36,41,47,58,61,64,69,82,86,91,97,100,114,117,122,133,136,151,155,160,166,169,172,178,182,185,191,194,197,200,204,208,211,237,240,244,350,369,372,375,379,382,441,444,448,451,508,511,514,518,521,524,530,536,542,545,549,552,569,572,576,580,583,587,590,594,597,601,604,608],[12,13,14],"p",{},"Uptime percentage tells you how often your service is available. It does not tell you how quickly your team finds out when it goes down, how long incidents take to resolve, or whether your infrastructure is becoming more or less stable over time.",[12,16,17,18,22,23,26,27,30],{},"Three metrics fill those gaps: ",[19,20,21],"strong",{},"MTTD",", ",[19,24,25],{},"MTTR",", and ",[19,28,29],{},"MTBF",". Together they form a complete picture of your incident response capability.",[32,33,35],"h2",{"id":34},"the-three-metrics","The Three Metrics",[37,38,40],"h3",{"id":39},"mttd-mean-time-to-detect","MTTD - Mean Time to Detect",[12,42,43,46],{},[19,44,45],{},"Definition:"," The average time between an incident starting and your team receiving an alert.",[48,49,54],"pre",{"className":50,"code":52,"language":53},[51],"language-text","MTTD = Sum of (alert time - incident start time) \u002F number of incidents\n","text",[55,56,52],"code",{"__ignoreMap":57},"",[12,59,60],{},"MTTD measures the gap between when something breaks and when your monitoring catches it. The size of that gap depends directly on your monitoring check interval and consensus verification speed.",[12,62,63],{},"With 5-minute check intervals, your MTTD is 0–5 minutes per incident (average 2.5 minutes just from the interval, before alert delivery time). With 1-minute intervals, it drops to 0.5–1.5 minutes. With 30-second intervals and near-instant alert delivery, MTTD can fall below 1 minute for most incidents.",[12,65,66],{},[19,67,68],{},"Industry benchmarks (2024 PagerDuty State of Digital Operations report):",[70,71,72,76,79],"ul",{},[73,74,75],"li",{},"Top-performing teams: MTTD under 5 minutes",[73,77,78],{},"Average teams: MTTD 15–30 minutes",[73,80,81],{},"Reactive teams (no monitoring, discovered via customer reports): MTTD 45+ minutes",[37,83,85],{"id":84},"mttr-mean-time-to-repair-or-resolve","MTTR - Mean Time to Repair (or Resolve)",[12,87,88,90],{},[19,89,45],{}," The average time between incident detection and full service restoration.",[48,92,95],{"className":93,"code":94,"language":53},[51],"MTTR = Sum of (resolution time - detection time) \u002F number of incidents\n",[55,96,94],{"__ignoreMap":57},[12,98,99],{},"MTTR measures your team's response and recovery speed. It includes:",[70,101,102,105,108,111],{},[73,103,104],{},"Time to acknowledge the alert",[73,106,107],{},"Time to diagnose the root cause",[73,109,110],{},"Time to implement a fix",[73,112,113],{},"Time to verify recovery",[12,115,116],{},"MTTR is the metric most directly tied to downtime cost. A business losing $200\u002Fminute in an outage pays $200 for a 1-minute MTTR and $6,000 for a 30-minute MTTR.",[12,118,119],{},[19,120,121],{},"Industry benchmarks (2024 PagerDuty report):",[70,123,124,127,130],{},[73,125,126],{},"Top-performing teams: MTTR under 30 minutes",[73,128,129],{},"Average teams: MTTR 1–4 hours",[73,131,132],{},"Poorly instrumented teams: MTTR 4+ hours",[12,134,135],{},"The biggest MTTR reduction levers, in order of impact:",[137,138,139,142,145,148],"ol",{},[73,140,141],{},"Better observability (monitoring data that tells you where the problem is, not just that a problem exists)",[73,143,144],{},"Runbooks and playbooks for common failure modes",[73,146,147],{},"Faster on-call notification (correct escalation routing)",[73,149,150],{},"Lower MTTD (shorter detection time means more time available for diagnosis and repair within SLA windows)",[37,152,154],{"id":153},"mtbf-mean-time-between-failures","MTBF - Mean Time Between Failures",[12,156,157,159],{},[19,158,45],{}," The average time between incidents over a given period.",[48,161,164],{"className":162,"code":163,"language":53},[51],"MTBF = (Total operational time) \u002F (Number of incidents)\n",[55,165,163],{"__ignoreMap":57},[12,167,168],{},"MTBF measures reliability. A service that goes down three times in 90 days has an MTBF of 30 days. A service that goes down once in 90 days has an MTBF of 90 days.",[12,170,171],{},"Tracking MTBF over time shows whether your infrastructure is becoming more or less stable. Increasing MTBF means your changes and improvements are reducing incident frequency. Decreasing MTBF means you're introducing instability faster than you're fixing it.",[12,173,174,177],{},[19,175,176],{},"Note on MTBF in software vs. hardware:"," MTBF originated in hardware reliability engineering where it measures the time before a component physically fails. In software, it's better understood as a stability trend metric than a predictor of the next failure.",[32,179,181],{"id":180},"how-mttd-and-mttr-interact","How MTTD and MTTR Interact",[12,183,184],{},"MTTD and MTTR are related but independent. You can have fast detection with slow repair, or slow detection with fast repair. The combination that matters is their sum: total incident duration.",[48,186,189],{"className":187,"code":188,"language":53},[51],"Total incident duration ≈ MTTD + time to acknowledge + MTTR\n",[55,190,188],{"__ignoreMap":57},[12,192,193],{},"For a 99.9% SLA (43 minutes of allowed monthly downtime), an incident that takes 5 minutes to detect and 25 minutes to repair consumes 30 minutes of your SLA budget. Two such incidents per month puts you in breach.",[12,195,196],{},"For a 99.99% SLA (4 minutes 22 seconds allowed per month), a single incident with a 3-minute MTTD and 4-minute MTTR is already at the edge.",[12,198,199],{},"This math makes the case for fast detection more concretely than uptime percentages. At the 99.99% tier, MTTD needs to be under 2 minutes.",[32,201,203],{"id":202},"calculating-your-metrics","Calculating Your Metrics",[37,205,207],{"id":206},"what-you-need","What you need",[12,209,210],{},"An incident record with timestamps:",[70,212,213,219,225,231],{},[73,214,215,218],{},[19,216,217],{},"Incident start time",": When the first check failed (from monitoring logs)",[73,220,221,224],{},[19,222,223],{},"Alert sent time",": When your monitoring tool delivered the notification",[73,226,227,230],{},[19,228,229],{},"Acknowledged time",": When an engineer confirmed they saw the alert",[73,232,233,236],{},[19,234,235],{},"Resolved time",": When service was fully restored",[12,238,239],{},"Good monitoring tools record all of these automatically. Vantaj's incident timeline tracks the exact start time of the first failed check (not the time the alert was sent - these differ by the consensus verification window), alert delivery time, and recovery confirmation.",[37,241,243],{"id":242},"example-calculation-3-month-period-6-incidents","Example calculation (3-month period, 6 incidents)",[245,246,247,264],"table",{},[248,249,250],"thead",{},[251,252,253,257,260,262],"tr",{},[254,255,256],"th",{},"Incident",[254,258,259],{},"Duration",[254,261,21],{},[254,263,25],{},[265,266,267,282,296,309,323,337],"tbody",{},[251,268,269,273,276,279],{},[270,271,272],"td",{},"#1",[270,274,275],{},"45 min",[270,277,278],{},"2 min",[270,280,281],{},"43 min",[251,283,284,287,290,293],{},[270,285,286],{},"#2",[270,288,289],{},"12 min",[270,291,292],{},"1 min",[270,294,295],{},"11 min",[251,297,298,301,304,306],{},[270,299,300],{},"#3",[270,302,303],{},"8 min",[270,305,292],{},[270,307,308],{},"7 min",[251,310,311,314,317,320],{},[270,312,313],{},"#4",[270,315,316],{},"3 hr 20 min",[270,318,319],{},"18 min",[270,321,322],{},"3 hr 2 min",[251,324,325,328,331,334],{},[270,326,327],{},"#5",[270,329,330],{},"22 min",[270,332,333],{},"3 min",[270,335,336],{},"19 min",[251,338,339,342,345,347],{},[270,340,341],{},"#6",[270,343,344],{},"15 min",[270,346,278],{},[270,348,349],{},"13 min",[12,351,352,353,356,359,360,363,365,366],{},"Average MTTD: (2+1+1+18+3+2) \u002F 6 = ",[19,354,355],{},"4.5 minutes",[357,358],"br",{},"\nAverage MTTR: (43+11+7+182+19+13) \u002F 6 = ",[19,361,362],{},"45.8 minutes",[357,364],{},"\nTotal downtime: 4.5 + 45.8 = ",[19,367,368],{},"50.3 minutes average per incident",[12,370,371],{},"Incident #4 is an outlier. Its 18-minute MTTD suggests the monitoring interval was too long or the failure happened in a blind spot. Its 3+ hour MTTR suggests no runbook existed for that failure type.",[12,373,374],{},"This is how you use these metrics: not just as averages, but as tools to find the incidents worth postmortem analysis.",[32,376,378],{"id":377},"what-good-metrics-look-like","What Good Metrics Look Like",[12,380,381],{},"These targets assume a production SaaS with active monitoring:",[245,383,384,400],{},[248,385,386],{},[251,387,388,391,394,397],{},[254,389,390],{},"Metric",[254,392,393],{},"Good",[254,395,396],{},"Acceptable",[254,398,399],{},"Needs Work",[265,401,402,415,428],{},[251,403,404,406,409,412],{},[270,405,21],{},[270,407,408],{},"\u003C 2 min",[270,410,411],{},"2–5 min",[270,413,414],{},"> 10 min",[251,416,417,419,422,425],{},[270,418,25],{},[270,420,421],{},"\u003C 30 min",[270,423,424],{},"30–90 min",[270,426,427],{},"> 2 hours",[251,429,430,432,435,438],{},[270,431,29],{},[270,433,434],{},"> 30 days",[270,436,437],{},"7–30 days",[270,439,440],{},"\u003C 7 days",[12,442,443],{},"Teams targeting 99.99% SLAs need MTTD under 1 minute, which requires 30-second check intervals and immediate alert delivery.",[32,445,447],{"id":446},"how-monitoring-configuration-affects-mttd","How Monitoring Configuration Affects MTTD",[12,449,450],{},"MTTD has a floor set by your monitoring interval plus consensus verification time. You cannot detect faster than you check.",[245,452,453,466],{},[248,454,455],{},[251,456,457,460,463],{},[254,458,459],{},"Check interval",[254,461,462],{},"Maximum MTTD from interval alone",[254,464,465],{},"Practical MTTD (with alert delivery)",[265,467,468,478,488,498],{},[251,469,470,473,475],{},[270,471,472],{},"5 minutes",[270,474,472],{},[270,476,477],{},"6–7 minutes",[251,479,480,483,485],{},[270,481,482],{},"1 minute",[270,484,482],{},[270,486,487],{},"1.5–2.5 minutes",[251,489,490,493,495],{},[270,491,492],{},"30 seconds",[270,494,492],{},[270,496,497],{},"45–90 seconds",[251,499,500,503,505],{},[270,501,502],{},"15 seconds",[270,504,502],{},[270,506,507],{},"30–60 seconds",[12,509,510],{},"Multi-region consensus adds a small, fixed cost (typically 10–30 seconds for verification) but eliminates false positives. A false positive that wakes up an engineer at 3 AM and takes 15 minutes to triage is worse than a slightly higher MTTD for real incidents.",[12,512,513],{},"The practical recommendation for production services with SLA commitments: 1-minute intervals as the minimum, 30 seconds for critical paths like authentication and payment.",[32,515,517],{"id":516},"improving-mttr-the-highest-leverage-investment","Improving MTTR: The Highest-Leverage Investment",[12,519,520],{},"Monitoring tools directly reduce MTTD. They reduce MTTR indirectly by providing better incident data.",[12,522,523],{},"The specific data points that cut diagnosis time:",[12,525,526,529],{},[19,527,528],{},"Per-region failure breakdown."," \"Frankfurt sees the failure, Virginia and Singapore do not\" immediately tells the engineer this is a routing or regional issue, not a global outage. Diagnosis jumps from 20 minutes to 5.",[12,531,532,535],{},[19,533,534],{},"Check-by-check timeline."," The exact timestamp of the first failed check, every check result during the incident window, and the pattern of failures (did it fail once then recover? continuously? intermittently?) tells the engineer what kind of failure it is before they've looked at a single log.",[12,537,538,541],{},[19,539,540],{},"Response time history."," Did latency spike before the outage? That pattern points to resource exhaustion or a slow query. Did it fail instantly? That points to a network or configuration change.",[12,543,544],{},"A monitoring tool that surfaces this data at the moment of an alert cuts MTTR significantly. The engineer arrives at the incident with a diagnosis hypothesis instead of a blank slate.",[32,546,548],{"id":547},"adding-mttd-mttr-and-mtbf-to-your-postmortems","Adding MTTD, MTTR, and MTBF to Your Postmortems",[12,550,551],{},"Every significant incident postmortem should include:",[137,553,554,557,560,563,566],{},[73,555,556],{},"Actual MTTD for this incident",[73,558,559],{},"How MTTD compared to your target",[73,561,562],{},"What caused the detection gap (if MTTD was higher than usual)",[73,564,565],{},"Actions taken to reduce MTTD for this failure class in the future",[73,567,568],{},"Same analysis for MTTR",[12,570,571],{},"Over time, this practice produces a prioritized list of monitoring gaps and response improvements. The incidents with the worst MTTD tell you where your monitoring has blind spots. The incidents with the worst MTTR tell you where you need runbooks.",[32,573,575],{"id":574},"frequently-asked-questions","Frequently Asked Questions",[37,577,579],{"id":578},"what-is-the-difference-between-mttr-and-mtbf","What is the difference between MTTR and MTBF?",[12,581,582],{},"MTTR measures how long it takes to fix an incident after it is detected. MTBF measures how frequently incidents occur. MTTR is a response efficiency metric; MTBF is a reliability metric. A service can have excellent MTTR (fast repairs) but poor MTBF (frequent failures), or vice versa. Improving reliability (MTBF) requires reducing root causes. Improving MTTR requires better observability, playbooks, and on-call processes.",[37,584,586],{"id":585},"what-is-mtta-mean-time-to-acknowledge","What is MTTA (Mean Time to Acknowledge)?",[12,588,589],{},"MTTA is the time between an alert being sent and an engineer acknowledging it. Some incident management tools track MTTA separately from MTTR. High MTTA indicates alerts are not reaching engineers effectively - wrong channels, alert fatigue causing engineers to ignore notifications, or on-call rotation problems.",[37,591,593],{"id":592},"how-often-should-i-measure-these-metrics","How often should I measure these metrics?",[12,595,596],{},"Calculate MTTD and MTTR per incident and track rolling 30-day and 90-day averages. Review MTBF monthly. Quarterly reviews of trends are more actionable than monthly snapshots for MTBF since low-frequency incidents make monthly numbers noisy.",[37,598,600],{"id":599},"does-scheduled-maintenance-count-toward-mttr","Does scheduled maintenance count toward MTTR?",[12,602,603],{},"No. Scheduled maintenance windows, where users are notified in advance and downtime is expected, are excluded from incident metrics. MTTR applies to unplanned incidents only.",[37,605,607],{"id":606},"what-tools-track-these-metrics-automatically","What tools track these metrics automatically?",[12,609,610],{},"Monitoring tools like Vantaj record incident start times, alert delivery times, and recovery times in the incident timeline. Incident management platforms like PagerDuty and Opsgenie calculate MTTD, MTTR, and MTTA from alert and acknowledgment data. For complete metric tracking, you need both: a monitoring tool to detect incidents accurately, and an incident management tool to track response times.",{"title":57,"searchDepth":612,"depth":612,"links":613},2,[614,620,621,625,626,627,628,629],{"id":34,"depth":612,"text":35,"children":615},[616,618,619],{"id":39,"depth":617,"text":40},3,{"id":84,"depth":617,"text":85},{"id":153,"depth":617,"text":154},{"id":180,"depth":612,"text":181},{"id":202,"depth":612,"text":203,"children":622},[623,624],{"id":206,"depth":617,"text":207},{"id":242,"depth":617,"text":243},{"id":377,"depth":612,"text":378},{"id":446,"depth":612,"text":447},{"id":516,"depth":612,"text":517},{"id":547,"depth":612,"text":548},{"id":574,"depth":612,"text":575,"children":630},[631,632,633,634,635],{"id":578,"depth":617,"text":579},{"id":585,"depth":617,"text":586},{"id":592,"depth":617,"text":593},{"id":599,"depth":617,"text":600},{"id":606,"depth":617,"text":607},"engineering","2026-06-10","Mean time to detect, mean time to repair, mean time between failures - these three numbers reveal more about your monitoring effectiveness than uptime percentage alone. Here's how to calculate them and what good looks like.","md",null,{},true,"\u002Fblog\u002Fmttr-mttd-mtbf-incident-metrics",9,{"title":5,"description":638},"blog\u002Fmttr-mttd-mtbf-incident-metrics","LljUv9q1rTLuUcusqVt2vUBISVDrIu-aVsQTs2pcZxY",1782825221660]