What Is MTTR and How to Reduce It
Mean Time to Resolution is the metric that tells you how fast your team recovers from incidents. Here's how to calculate it, what a good MTTR looks like, and practical ways to bring yours down.
The Metric Your Investors Will Eventually Ask About
At some point - during a postmortem, a board meeting, or an enterprise sales call - someone will ask: "How long does it take you to recover from an outage?" If you don't have a number, the answer defaults to "we don't track that," which translates to "we don't take reliability seriously."
MTTR - Mean Time to Resolution - is that number. It measures the average time from when an incident starts to when the service is fully restored. It's the single best indicator of how well your team handles things when they break.
A low MTTR doesn't mean you never have outages. It means when you do, your team detects them fast, diagnoses them fast, and fixes them fast. That's the difference between a 5-minute blip that nobody notices and a 2-hour crisis that makes the news.
How to Calculate MTTR
The formula is simple:
MTTR = Total resolution time across all incidents ÷ Number of incidents
If you had 4 incidents this month with resolution times of 8 minutes, 23 minutes, 45 minutes, and 12 minutes:
MTTR = (8 + 23 + 45 + 12) ÷ 4 = 22 minutes
Your MTTR for the month is 22 minutes.
What Counts as "Resolution Time"
Resolution time starts when the incident begins - not when your team is notified, and not when someone acknowledges the alert. It includes the entire lifecycle:
| Phase | What Happens | Included in MTTR? |
|---|---|---|
| Detection | Monitoring detects the failure | ✅ |
| Notification | Alert reaches the on-call engineer | ✅ |
| Acknowledgment | Engineer sees and responds to the alert | ✅ |
| Diagnosis | Engineer identifies the root cause | ✅ |
| Remediation | Fix is applied (rollback, restart, config change) | ✅ |
| Verification | Service is confirmed healthy | ✅ |
Every phase from first failure to confirmed recovery counts. This is why detection time matters so much - a 5-minute detection delay adds 5 minutes to every single incident's resolution time.
MTTR vs. the Other MTTx Metrics
MTTR is part of a family of incident metrics. Understanding the differences helps you know what to optimize:
| Metric | Measures | Starts At | Ends At |
|---|---|---|---|
| MTTD (Mean Time to Detect) | How fast you find out | Incident starts | First alert fires |
| MTTA (Mean Time to Acknowledge) | How fast someone responds | Alert fires | Engineer acknowledges |
| MTTR (Mean Time to Resolution) | Total recovery time | Incident starts | Service restored |
| MTBF (Mean Time Between Failures) | Reliability / stability | Previous recovery | Next incident |
MTTR is the most commonly tracked because it captures the full picture - from failure to recovery. But if your MTTR is high, breaking it down into MTTD, MTTA, and repair time tells you where the bottleneck is.
- High MTTD? Your monitoring is too slow or missing coverage.
- High MTTA? Your alerting isn't reaching the right people, or on-call fatigue is causing delays.
- High repair time? Your team lacks runbooks, rollback procedures, or access to the right systems.
What's a Good MTTR?
It depends on what you're running, but here are reasonable benchmarks:
| Service Type | Good MTTR | Average MTTR | Needs Work |
|---|---|---|---|
| Revenue-critical (checkout, payments) | < 10 min | 10–30 min | > 30 min |
| Core product (app, API, auth) | < 15 min | 15–45 min | > 45 min |
| Internal tools (admin, analytics) | < 30 min | 30–90 min | > 90 min |
| Background jobs (cron, workers) | < 60 min | 1–4 hours | > 4 hours |
These benchmarks assume a team with monitoring, alerting, and basic incident procedures in place. If your MTTR for critical services is consistently under 15 minutes, your incident response is strong. If it's over an hour, there's significant room to improve.
The Five Levers That Reduce MTTR
1. Faster Detection (Reduce MTTD)
Detection time is the single biggest lever. You can't fix what you don't know about.
The difference check intervals make:
| Check Interval | Worst-Case Detection Time | Impact on MTTR |
|---|---|---|
| 5 minutes | Up to 5 minutes | Adds 2.5 min average |
| 1 minute | Up to 1 minute | Adds 30 sec average |
| 30 seconds | Up to 30 seconds | Adds 15 sec average |
Switching from 5-minute to 30-second check intervals removes an average of 2+ minutes from every incident's MTTR. Over dozens of incidents per year, that's hours of cumulative downtime eliminated.
Multi-region monitoring also reduces detection time by catching regional failures that single-region checks miss. If your service is down in Europe but up in the US, a US-only monitor won't detect it. Multi-region consensus catches it on the first check.
What to do:
- Monitor all critical endpoints, not just the homepage
- Use 30-second or 1-minute check intervals for revenue-critical services
- Add heartbeat monitoring for background jobs and cron tasks
- Monitor SSL and domain expiry to prevent certificate-related outages
2. Faster Notification (Reduce MTTA)
Detection is useless if the alert doesn't reach someone who can act on it. The gap between "alert fired" and "engineer is investigating" is pure waste.
Common notification bottlenecks:
- Alerts go to email, which isn't checked at 2 AM
- Alerts go to a shared Slack channel where nobody has notifications enabled
- The on-call engineer's phone is on silent
- There's no escalation policy - if the first person doesn't respond, nobody else is notified
What to do:
- Route critical alerts to Slack channels with notifications enabled
- Set up escalation policies: if no acknowledgment in 10 minutes, alert the next person
- Use multiple channels simultaneously (Slack + email + SMS for critical incidents)
- Keep on-call rotations short (1 week max) to prevent fatigue
3. Faster Diagnosis
Once an engineer is looking at the problem, how quickly can they figure out what's wrong? Diagnosis is often the longest phase - and it's the hardest to optimize because every incident is different.
What slows diagnosis down:
- No context in the alert ("Monitor XYZ is down" tells you nothing about why)
- No runbooks or documented troubleshooting steps
- No access to the right systems (VPN, production database, cloud console)
- Multiple engineers investigating the same thing without coordinating
What to do:
- Include context in alerts: which service, which region, how long it's been down, recent changes
- Maintain runbooks for common failure modes (database connection exhaustion, certificate expiry, deployment rollback)
- Ensure on-call engineers have production access configured before they go on-call
- Designate an incident commander for multi-person incidents to prevent duplicate work
4. Faster Remediation
The fix itself. This is where runbooks, automation, and pre-planned responses turn a 30-minute repair into a 2-minute rollback.
Common fast fixes:
- Rollback the last deployment - If the incident started right after a deploy, rollback is almost always the right first move. Investigate after the service is restored.
- Restart the service - Connection pool exhaustion, memory leaks, and stuck processes are all fixed by a restart. It's not a permanent fix, but it restores service while you investigate.
- Scale up - If traffic is overwhelming capacity, add instances first, optimize later.
- Flip a feature flag - If a new feature is causing issues, disable it without a full deployment.
What to do:
- Make rollbacks one-click or one-command operations
- Document the restart procedure for every critical service
- Pre-configure auto-scaling thresholds
- Use feature flags for risky releases so they can be disabled instantly
5. Better Post-Incident Learning
This doesn't reduce MTTR for the current incident - it reduces it for every future incident. Teams that run postmortems and update their runbooks after each incident see their MTTR decrease over time.
What to do:
- Run a blameless postmortem for every incident over 15 minutes
- Update runbooks with new failure modes and their fixes
- Track MTTR trends monthly - are you getting faster or slower?
- Identify recurring incidents and invest in permanent fixes
Tracking MTTR Over Time
MTTR as a single number is useful. MTTR as a trend is powerful. Track it monthly and look for patterns:
- Improving trend - Your incident response is maturing. Detection is faster, runbooks are working, the team is learning.
- Flat trend - You've plateaued. Look for the bottleneck (detection, notification, diagnosis, or repair) and focus improvement there.
- Worsening trend - Something is degrading. Common causes: growing complexity without updated runbooks, alert fatigue causing slower response, team growth without incident training.
Break your MTTR down by service to find outliers. If your API's MTTR is 8 minutes but your payment service's MTTR is 45 minutes, the payment service needs dedicated attention - better monitoring, specific runbooks, or faster rollback procedures.
Common Mistakes When Measuring MTTR
Excluding Detection Time
Some teams start the clock when the alert fires, not when the incident actually begins. This hides the most improvable phase - detection - and makes MTTR look artificially low. Start the clock when the service first fails, whether or not your monitoring caught it immediately.
Averaging Across All Severity Levels
Mixing 2-minute blips with 4-hour major incidents produces a meaningless average. Track MTTR separately for critical, major, and minor incidents. A critical MTTR of 45 minutes is very different from a minor MTTR of 45 minutes.
Not Tracking It at All
If you're not measuring MTTR, you can't improve it. You also can't answer questions about your reliability from investors, enterprise prospects, or your own leadership. Start tracking it - even manually - today.
How Vantaj Helps Reduce MTTR
Every phase of MTTR maps to a Vantaj capability:
| MTTR Phase | How Vantaj Helps |
|---|---|
| Detection | 30-second check intervals, multi-region consensus, heartbeat monitoring for background jobs |
| Notification | Instant alerts via Slack, email, Discord, webhooks - with escalation support |
| Diagnosis | Incident timeline shows exactly when the failure started, which regions are affected, and response time trends leading up to the outage |
| Verification | Auto-recovery detection - when the monitor recovers, the incident closes automatically with precise duration |
| Tracking | Every incident is logged with start time, duration, and resolution - giving you the data to calculate MTTR without spreadsheets |
Vantaj doesn't just monitor uptime - it builds the incident record that makes MTTR measurable and improvable. When every incident has an automatic start time, a resolution time, and a duration, tracking MTTR goes from a manual exercise to a dashboard you check monthly.