SLI, SLO, and SLA: What They Mean and How to Implement Them
SLIs measure what your service actually does. SLOs define the target you're aiming for. SLAs are the contractual commitments you make to customers. Understanding the difference is the foundation of reliability engineering.
SLI (Service Level Indicator) is a quantitative measurement of your service's behavior: request success rate, latency, availability percentage. An SLI answers the question "what is the service doing right now?"
SLO (Service Level Objective) is the target you set for an SLI: "99.9% of requests succeed," "p99 latency stays under 500ms." An SLO is an internal commitment to your team.
SLA (Service Level Agreement) is a contract with your customers that defines minimum service levels and the consequences (typically credits) when you fall short. An SLA is an external commitment.
Most teams confuse these three because the terms sound similar and all involve percentages. Getting them right is the foundation of every reliability program that actually works.
The Relationship Between SLI, SLO, and SLA
Think of them as a hierarchy:
SLI → what you measure
SLO → what you target internally
SLA → what you promise externally
Your SLO is always stricter than your SLA. If your SLA promises 99.9% uptime, your internal SLO should target 99.95% or better. The buffer between your SLO and SLA is your margin: the space that absorbs unexpected incidents before you breach your customer commitment.
A team that sets their SLO equal to their SLA is one bad week away from a contract violation.
SLIs: Choosing What to Measure
An SLI only matters if it reflects something your users actually experience. The most common mistake is measuring what's easy to collect (CPU usage, request count) rather than what indicates user experience.
Four SLI categories that map to user experience:
| Category | What it measures | Example |
|---|---|---|
| Availability | Is the service responding at all? | % of HTTP requests returning 2xx or 3xx |
| Latency | Is it fast enough? | % of requests completing under 500ms |
| Error rate | Is it returning correct results? | % of requests not returning 5xx errors |
| Throughput | Is it handling the load? | Requests processed per second vs. expected |
Availability SLI formula:
Availability = (Successful requests / Total requests) × 100
Where "successful" means the response was correct: right status code, valid content, within acceptable time.
Latency SLI formula:
Latency SLI = (Requests completing under threshold / Total requests) × 100
Most teams track p50 (median), p95, and p99 latency. The p99 is most important for user experience because it represents the worst 1% of requests - the users who always seem to be on your support tickets.
Choosing SLI Windows
SLIs are measured over a window of time. Common windows:
- Rolling 7 days: Most sensitive; recent incidents weigh heavily
- Rolling 28–30 days: Standard for monthly SLA reporting
- Calendar month: Aligns with billing cycles; simplest to communicate to customers
Most teams use a 28-day rolling window for their SLO tracking and a calendar month for SLA reporting.
SLOs: Setting the Right Targets
A well-chosen SLO is ambitious enough to require real engineering effort but realistic enough to actually achieve. Too high and you spend all your time on reliability instead of features. Too low and outages become normalized.
The 99.9% Question
99.9% availability allows 43.8 minutes of downtime per month. For most B2B SaaS applications, this is a reasonable starting point.
| Uptime target | Downtime per month | Suitable for |
|---|---|---|
| 99% | 7 hours 12 min | Non-critical internal tools |
| 99.5% | 3 hours 36 min | Lower-criticality B2B features |
| 99.9% | 43.8 minutes | Standard B2B SaaS |
| 99.95% | 21.9 minutes | SLA-driven enterprise |
| 99.99% | 4.4 minutes | Payment processors, auth systems |
For a full breakdown of nines with exact downtime calculations, see SLA nines explained.
SLO Tiers for Different Services
Not all services in your product deserve the same SLO. A tiered approach matches engineering investment to actual user impact:
Tier 1 (99.95%+ SLO): Authentication, checkout, payment processing, primary API. These are in the critical path of user value.
Tier 2 (99.9% SLO): Core product features, data ingestion, user dashboard. Important but not immediately transactional.
Tier 3 (99.5% SLO): Admin interfaces, reporting, analytics dashboards. Degradation is noticed but not immediately business-critical.
Tier 4 (99% SLO): Internal tools, non-customer-facing services. Lower stakes.
Documenting SLOs
An SLO document should specify:
- The SLI being measured (what metric, from where)
- The target percentage
- The measurement window (28-day rolling is standard)
- The exclusions (planned maintenance, force majeure)
- The owner (which team is responsible)
Error Budgets: Turning SLOs into Engineering Decisions
An error budget is the amount of downtime or errors your SLO permits over the measurement window. It's the practical tool that makes SLOs actionable.
Error budget formula:
Error budget = (1 - SLO target) × measurement window in minutes
For a 99.9% SLO over 30 days (43,200 minutes):
Error budget = (1 - 0.999) × 43,200 = 43.2 minutes
You have 43.2 minutes of downtime available this month before breaching your SLO.
Using Error Budgets to Guide Decisions
The error budget creates a natural feedback loop between reliability and feature work:
- Budget is healthy (>50% remaining): Engineering can take more risk. Deploy features aggressively, accept more experimental changes.
- Budget is at 50%: Moderate caution. Review deployment frequency and rollback practices.
- Budget is low (<20%): Slow down. Focus engineering effort on reliability, not features. Postpone risky changes.
- Budget is exhausted: Full stop on new deployments until reliability is restored. Conduct a postmortem. Run a reliability sprint.
This is not a manual decision - it's a policy. Teams that define the policy in advance avoid the arguments about "is now a good time to deploy?" during stressful moments.
SLAs: Making External Commitments
An SLA is a legal agreement. Before offering one, you need to know two things: what your SLO actually is (so you know what you can promise) and what consequences your business can accept (so you know what credits to offer).
SLA Credit Structures
Most SaaS SLAs offer monthly service credits as compensation for downtime below the promised uptime:
| Uptime achieved | Credit |
|---|---|
| 99.0% – 99.9% | 10% of monthly fee |
| 95.0% – 99.0% | 25% of monthly fee |
| Below 95.0% | 50% of monthly fee |
This is a common structure. Your legal team and pricing model determine the specific thresholds and percentages.
What SLAs Should Exclude
Standard SLA exclusions:
- Scheduled maintenance windows (with adequate advance notice, typically 24–72 hours)
- Force majeure events
- Customer-caused outages (misconfigured webhooks, API abuse)
- Third-party provider outages outside your control
- Failures caused by beta or preview features
Measuring SLA Compliance
You cannot report on SLA compliance without monitoring data. You need:
- External monitoring that tracks availability from outside your infrastructure (synthetic probes, not internal health checks)
- Per-incident duration data with accurate start and end timestamps
- Planned maintenance records to exclude from availability calculations
- Historical data retention long enough to cover your reporting period
Internal metrics from APM tools or application logs are insufficient for SLA reporting because they only capture what your application sees. An outage caused by DNS failure, network connectivity, or a broken load balancer may not appear in application metrics at all. External monitoring from multiple geographic locations is the only source that captures the same view your customers have.
SLO Monitoring in Practice
Tracking SLO Burn Rate
A burn rate tells you how fast you're consuming your error budget relative to the allowed rate. An SLO with a 28-day window burns at "1x" when it's on track to use exactly the error budget in 28 days.
If your burn rate is 2x, you'll exhaust the error budget in 14 days instead of 28 - the current incident or degradation is happening faster than your budget allows.
Burn rate formula:
Burn rate = (Current error rate / (1 - SLO target)) × 1
A burn rate above 2x in a 1-hour window typically warrants a page to the on-call engineer.
Alerting on SLO Violations
Don't alert on every SLI measurement. Alert on budget burn rate:
- Fast burn alert: Error rate 2x+ the budget rate over the past 60 minutes. High priority, page the on-call engineer.
- Slow burn alert: Error rate 5x+ the budget rate over the past 6 hours. Medium priority, alert the team.
- Budget exhausted: SLO has been breached this period. Highest priority.
This approach reduces alert noise significantly compared to raw error rate alerting.
Building an SLO Dashboard
An effective SLO dashboard shows:
- Current availability percentage (rolling 28 days)
- Error budget remaining (% and absolute time)
- Current burn rate
- Time to budget exhaustion at current rate
- Incident history for the period
Vantaj's uptime percentage tracking and incident history provide the raw data for this. For the error budget calculations, most teams build a lightweight dashboard in their observability platform using the uptime data as input.
Common Mistakes
Setting SLOs without measuring SLIs first. You can't set a meaningful target without baseline data. Run your monitoring for 30 days and look at actual availability before committing to a number.
SLA equals SLO. Always keep a buffer. Your SLA should be at least 0.05–0.1 percentage points below your SLO.
Measuring from internal health checks only. Internal health checks miss DNS failures, CDN issues, and network problems. External synthetic monitoring is required for accurate SLA reporting.
Ignoring planned maintenance. Every minute your service is unavailable counts against your SLO unless you have a formal maintenance window process with adequate customer notice.
Never reviewing the SLO. SLOs become meaningless if they're set once and never adjusted. Review them when you change architecture, when you hit a new reliability milestone, or when your SLA commitments change.
Frequently Asked Questions
What is the difference between SLA and SLO?
An SLA (Service Level Agreement) is a contract with customers defining minimum uptime commitments and the consequences of falling short, typically in the form of service credits. An SLO (Service Level Objective) is an internal target your team sets for itself, always stricter than the SLA. The gap between SLO and SLA is your buffer against contract breaches.
What is an error budget?
An error budget is the amount of downtime or errors your SLO permits over a given measurement window. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes. When that budget runs out, your SLO is breached. Error budgets translate abstract percentages into concrete time, making them easier to reason about in engineering decisions.
What is an SLI?
An SLI (Service Level Indicator) is the specific metric you measure to track service health: request success rate, response latency, error rate, or availability percentage. SLIs are the raw data that SLOs are defined against. A good SLI directly reflects what users experience, not just what's easy to measure from your infrastructure.
How do I calculate SLA uptime?
Uptime percentage = (total time in period - total downtime in period) / total time in period × 100. For accurate SLA reporting, use external synthetic monitoring data, not internal metrics. Internal metrics can miss outages caused by DNS, network, or load balancer issues that are invisible to your application. Exclude any planned maintenance windows that were communicated in advance per your SLA terms.
What SLO should I start with?
99.9% availability is a reasonable starting point for most B2B SaaS applications. Before committing, measure your actual current availability for 30 days using external monitoring. If you're currently achieving 99.5%, setting a 99.9% SLO requires real reliability investment. If you're already at 99.95%, you can set a more ambitious target.