Back to blog
Infrastructure

What Is MTTR and How to Reduce It

Mean Time to Resolution is the metric that tells you how fast your team recovers from incidents. Here's how to calculate it, what a good MTTR looks like, and practical ways to bring yours down.

Vantaj Team · June 20, 2026 · 9 min read

The Metric Your Investors Will Eventually Ask About

At some point - during a postmortem, a board meeting, or an enterprise sales call - someone will ask: "How long does it take you to recover from an outage?" If you don't have a number, the answer defaults to "we don't track that," which translates to "we don't take reliability seriously."

MTTR - Mean Time to Resolution - is that number. It measures the average time from when an incident starts to when the service is fully restored. It's the single best indicator of how well your team handles things when they break.

A low MTTR doesn't mean you never have outages. It means when you do, your team detects them fast, diagnoses them fast, and fixes them fast. That's the difference between a 5-minute blip that nobody notices and a 2-hour crisis that makes the news.

How to Calculate MTTR

The formula is simple:

MTTR = Total resolution time across all incidents ÷ Number of incidents

If you had 4 incidents this month with resolution times of 8 minutes, 23 minutes, 45 minutes, and 12 minutes:

MTTR = (8 + 23 + 45 + 12) ÷ 4 = 22 minutes

Your MTTR for the month is 22 minutes.

What Counts as "Resolution Time"

Resolution time starts when the incident begins - not when your team is notified, and not when someone acknowledges the alert. It includes the entire lifecycle:

PhaseWhat HappensIncluded in MTTR?
DetectionMonitoring detects the failure
NotificationAlert reaches the on-call engineer
AcknowledgmentEngineer sees and responds to the alert
DiagnosisEngineer identifies the root cause
RemediationFix is applied (rollback, restart, config change)
VerificationService is confirmed healthy

Every phase from first failure to confirmed recovery counts. This is why detection time matters so much - a 5-minute detection delay adds 5 minutes to every single incident's resolution time.

MTTR vs. the Other MTTx Metrics

MTTR is part of a family of incident metrics. Understanding the differences helps you know what to optimize:

MetricMeasuresStarts AtEnds At
MTTD (Mean Time to Detect)How fast you find outIncident startsFirst alert fires
MTTA (Mean Time to Acknowledge)How fast someone respondsAlert firesEngineer acknowledges
MTTR (Mean Time to Resolution)Total recovery timeIncident startsService restored
MTBF (Mean Time Between Failures)Reliability / stabilityPrevious recoveryNext incident

MTTR is the most commonly tracked because it captures the full picture - from failure to recovery. But if your MTTR is high, breaking it down into MTTD, MTTA, and repair time tells you where the bottleneck is.

  • High MTTD? Your monitoring is too slow or missing coverage.
  • High MTTA? Your alerting isn't reaching the right people, or on-call fatigue is causing delays.
  • High repair time? Your team lacks runbooks, rollback procedures, or access to the right systems.

What's a Good MTTR?

It depends on what you're running, but here are reasonable benchmarks:

Service TypeGood MTTRAverage MTTRNeeds Work
Revenue-critical (checkout, payments)< 10 min10–30 min> 30 min
Core product (app, API, auth)< 15 min15–45 min> 45 min
Internal tools (admin, analytics)< 30 min30–90 min> 90 min
Background jobs (cron, workers)< 60 min1–4 hours> 4 hours

These benchmarks assume a team with monitoring, alerting, and basic incident procedures in place. If your MTTR for critical services is consistently under 15 minutes, your incident response is strong. If it's over an hour, there's significant room to improve.

The Five Levers That Reduce MTTR

1. Faster Detection (Reduce MTTD)

Detection time is the single biggest lever. You can't fix what you don't know about.

The difference check intervals make:

Check IntervalWorst-Case Detection TimeImpact on MTTR
5 minutesUp to 5 minutesAdds 2.5 min average
1 minuteUp to 1 minuteAdds 30 sec average
30 secondsUp to 30 secondsAdds 15 sec average

Switching from 5-minute to 30-second check intervals removes an average of 2+ minutes from every incident's MTTR. Over dozens of incidents per year, that's hours of cumulative downtime eliminated.

Multi-region monitoring also reduces detection time by catching regional failures that single-region checks miss. If your service is down in Europe but up in the US, a US-only monitor won't detect it. Multi-region consensus catches it on the first check.

What to do:

  • Monitor all critical endpoints, not just the homepage
  • Use 30-second or 1-minute check intervals for revenue-critical services
  • Add heartbeat monitoring for background jobs and cron tasks
  • Monitor SSL and domain expiry to prevent certificate-related outages

2. Faster Notification (Reduce MTTA)

Detection is useless if the alert doesn't reach someone who can act on it. The gap between "alert fired" and "engineer is investigating" is pure waste.

Common notification bottlenecks:

  • Alerts go to email, which isn't checked at 2 AM
  • Alerts go to a shared Slack channel where nobody has notifications enabled
  • The on-call engineer's phone is on silent
  • There's no escalation policy - if the first person doesn't respond, nobody else is notified

What to do:

  • Route critical alerts to Slack channels with notifications enabled
  • Set up escalation policies: if no acknowledgment in 10 minutes, alert the next person
  • Use multiple channels simultaneously (Slack + email + SMS for critical incidents)
  • Keep on-call rotations short (1 week max) to prevent fatigue

3. Faster Diagnosis

Once an engineer is looking at the problem, how quickly can they figure out what's wrong? Diagnosis is often the longest phase - and it's the hardest to optimize because every incident is different.

What slows diagnosis down:

  • No context in the alert ("Monitor XYZ is down" tells you nothing about why)
  • No runbooks or documented troubleshooting steps
  • No access to the right systems (VPN, production database, cloud console)
  • Multiple engineers investigating the same thing without coordinating

What to do:

  • Include context in alerts: which service, which region, how long it's been down, recent changes
  • Maintain runbooks for common failure modes (database connection exhaustion, certificate expiry, deployment rollback)
  • Ensure on-call engineers have production access configured before they go on-call
  • Designate an incident commander for multi-person incidents to prevent duplicate work

4. Faster Remediation

The fix itself. This is where runbooks, automation, and pre-planned responses turn a 30-minute repair into a 2-minute rollback.

Common fast fixes:

  • Rollback the last deployment - If the incident started right after a deploy, rollback is almost always the right first move. Investigate after the service is restored.
  • Restart the service - Connection pool exhaustion, memory leaks, and stuck processes are all fixed by a restart. It's not a permanent fix, but it restores service while you investigate.
  • Scale up - If traffic is overwhelming capacity, add instances first, optimize later.
  • Flip a feature flag - If a new feature is causing issues, disable it without a full deployment.

What to do:

  • Make rollbacks one-click or one-command operations
  • Document the restart procedure for every critical service
  • Pre-configure auto-scaling thresholds
  • Use feature flags for risky releases so they can be disabled instantly

5. Better Post-Incident Learning

This doesn't reduce MTTR for the current incident - it reduces it for every future incident. Teams that run postmortems and update their runbooks after each incident see their MTTR decrease over time.

What to do:

  • Run a blameless postmortem for every incident over 15 minutes
  • Update runbooks with new failure modes and their fixes
  • Track MTTR trends monthly - are you getting faster or slower?
  • Identify recurring incidents and invest in permanent fixes

Tracking MTTR Over Time

MTTR as a single number is useful. MTTR as a trend is powerful. Track it monthly and look for patterns:

  • Improving trend - Your incident response is maturing. Detection is faster, runbooks are working, the team is learning.
  • Flat trend - You've plateaued. Look for the bottleneck (detection, notification, diagnosis, or repair) and focus improvement there.
  • Worsening trend - Something is degrading. Common causes: growing complexity without updated runbooks, alert fatigue causing slower response, team growth without incident training.

Break your MTTR down by service to find outliers. If your API's MTTR is 8 minutes but your payment service's MTTR is 45 minutes, the payment service needs dedicated attention - better monitoring, specific runbooks, or faster rollback procedures.

Common Mistakes When Measuring MTTR

Excluding Detection Time

Some teams start the clock when the alert fires, not when the incident actually begins. This hides the most improvable phase - detection - and makes MTTR look artificially low. Start the clock when the service first fails, whether or not your monitoring caught it immediately.

Averaging Across All Severity Levels

Mixing 2-minute blips with 4-hour major incidents produces a meaningless average. Track MTTR separately for critical, major, and minor incidents. A critical MTTR of 45 minutes is very different from a minor MTTR of 45 minutes.

Not Tracking It at All

If you're not measuring MTTR, you can't improve it. You also can't answer questions about your reliability from investors, enterprise prospects, or your own leadership. Start tracking it - even manually - today.

How Vantaj Helps Reduce MTTR

Every phase of MTTR maps to a Vantaj capability:

MTTR PhaseHow Vantaj Helps
Detection30-second check intervals, multi-region consensus, heartbeat monitoring for background jobs
NotificationInstant alerts via Slack, email, Discord, webhooks - with escalation support
DiagnosisIncident timeline shows exactly when the failure started, which regions are affected, and response time trends leading up to the outage
VerificationAuto-recovery detection - when the monitor recovers, the incident closes automatically with precise duration
TrackingEvery incident is logged with start time, duration, and resolution - giving you the data to calculate MTTR without spreadsheets

Vantaj doesn't just monitor uptime - it builds the incident record that makes MTTR measurable and improvable. When every incident has an automatic start time, a resolution time, and a duration, tracking MTTR goes from a manual exercise to a dashboard you check monthly.