SLA vs SLO vs SLI: Key Differences, Real Examples, and the Mistakes That Break Teams
SLI, SLO, and SLA are three distinct concepts that most engineering teams conflate. This guide covers the real differences with concrete examples, common failure modes, and how to implement all three without turning them into bureaucracy.
SLI, SLO, and SLA are three distinct concepts that most teams use interchangeably until something breaks. The conflation is harmless until you are in an incident review trying to explain to a customer whether you breached your SLA, or until engineering and product disagree on whether a deployment should ship because the error budget is at 23%.
This guide defines each concept precisely, shows real examples of how they work together, and covers the specific mistakes that turn a good reliability framework into bureaucracy.
Definitions: the one-paragraph version
SLI (Service Level Indicator): A specific quantitative measurement of your service's behavior. SLIs are raw numbers - a percentage, a latency value, an error count. They answer the question: what did the service actually do?
SLO (Service Level Objective): A target range or threshold for an SLI, set internally by your engineering team. SLOs are commitments to yourselves. They answer: what should the service do?
SLA (Service Level Agreement): A contractual commitment made to external parties - customers, partners, regulators - about service behavior. SLAs carry consequences (refunds, credits, termination rights) when breached. They answer: what did we promise customers?
The relationship: SLIs are what you measure, SLOs are what you target, SLAs are what you commit to externally. Each one feeds into the next.
Side-by-side comparison
| Dimension | SLI | SLO | SLA |
|---|---|---|---|
| What it is | Raw measurement | Internal target | External contract |
| Who sets it | Engineering | Engineering + Product | Business + Legal |
| Audience | Engineering team | Engineering + Leadership | Customers / Partners |
| Consequence if missed | Data point | Engineering priority shift | Refunds, credits, churn |
| Example | 99.94% request success rate | 99.9% request success rate | 99.5% uptime |
| Typical strictness | Actual behavior | Tighter than SLA | More lenient than SLO |
Real examples from common SaaS categories
E-commerce / transactional SaaS
| Layer | SLI | SLO | SLA |
|---|---|---|---|
| Checkout availability | % of checkout requests returning 2xx | 99.95% per month | 99.5% per month |
| Payment processing latency | p99 latency for payment API | Under 2,000ms | Under 5,000ms |
| Order confirmation email | % of confirmation emails delivered in 5 min | 99.5% | 99% |
API-first SaaS
| Layer | SLI | SLO | SLA |
|---|---|---|---|
| API availability | % successful API responses | 99.9% per calendar month | 99.5% per calendar month |
| API latency | p95 response time across all endpoints | Under 500ms | Under 2,000ms |
| Webhook delivery | % of webhooks delivered within 30 seconds | 99.5% | 99% |
Infrastructure / platform SaaS
| Layer | SLI | SLO | SLA |
|---|---|---|---|
| Data plane availability | % of read/write operations succeeding | 99.99% | 99.9% |
| Job execution reliability | % of scheduled jobs completing without error | 99.5% | 99% |
| Dashboard availability | % of dashboard page loads returning 200 | 99.5% | 99% |
The SLO-SLA gap: why it exists and how big it should be
The gap between your SLO (internal target) and SLA (external commitment) is your operational margin. It gives your team room to detect and fix problems before they become contractual breaches.
The gap should reflect your actual operational capability:
| Operational maturity | Suggested SLO-SLA gap |
|---|---|
| No dedicated SRE team, manual incident response | 0.4 to 0.5 percentage points (e.g., SLA 99.5%, SLO 99.9%) |
| On-call rotation, automated alerting | 0.2 to 0.3 percentage points |
| Dedicated SRE team, mature incident response | 0.05 to 0.1 percentage points |
Setting the gap too small (SLO ≈ SLA) means a single significant incident immediately threatens your contractual commitment. Setting it too large means your SLA is so lenient it provides no meaningful customer protection.
Error budgets: making SLOs usable
An error budget is the allowable failure in a given period derived from your SLO. It converts an abstract percentage into a concrete operational resource.
Error budget calculation:
Error budget = (1 - SLO target) × period duration
For a 99.9% monthly SLO:
- Monthly period = 43,800 minutes
- Error budget = 0.001 × 43,800 = 43.8 minutes per month
For a 99.5% monthly SLO:
- Error budget = 0.005 × 43,800 = 219 minutes per month
How teams use error budgets:
- Budget > 50% remaining: Engineering can deploy freely, experiment with reliability tradeoffs
- Budget 20–50% remaining: Deploy with caution, require post-incident reviews for incidents
- Budget < 20% remaining: Freeze non-critical deployments, prioritize reliability fixes
- Budget exhausted: Stop new feature work until reliability is restored
Error budgets answer the question that causes the most team friction: "should we ship this risky change?" When the budget is healthy, yes. When it's nearly gone, no. The number removes the political dimension from the decision.
For monitoring how your error budget tracks in real time, see SLI, SLO, SLA implementation guide and uptime SLA monitoring.
Choosing the right SLIs
The most common SLI mistake is measuring the wrong thing. The right SLI measures what users experience, not what your infrastructure reports.
User-facing SLIs (measure these):
- Request success rate (percentage of requests returning expected response)
- Request latency (p50, p95, p99 latency for user-initiated requests)
- Availability (percentage of time the service responds to checks)
- Error rate (percentage of requests resulting in error responses)
Infrastructure SLIs (measure as diagnostics, not SLOs):
- CPU utilization
- Memory usage
- Database connection pool saturation
- Queue depth
CPU at 90% is not an SLI - users don't experience CPU utilization. Users experience slow responses, which you capture with a latency SLI. Infrastructure metrics are diagnostic signals that explain why an SLI degraded, not SLIs themselves.
What good SLOs look like vs common failure modes
Good SLO characteristics:
- Tied to a specific user journey or service function, not a generic "uptime" claim
- Measurable automatically from existing monitoring data
- Reviewed and updated quarterly as traffic patterns change
- Set with input from the people who will be paged when they are breached
Common SLO mistakes:
| Mistake | What it looks like | Why it fails |
|---|---|---|
| SLO equals SLA | "We promise 99.9%, so we target 99.9%" | No operational margin; any significant incident immediately breaches the customer commitment |
| SLO too ambitious | Setting 99.999% when historical availability is 99.7% | The SLO is breached constantly, teams start ignoring alerts, the metric loses credibility |
| Wrong SLI | Measuring "server health" instead of user request success rate | Infrastructure looks fine while user experience degrades |
| No review cadence | Setting SLOs and never revisiting them | SLOs drift from reality as traffic patterns, dependencies, and architecture change |
| Too many SLOs | Tracking 40 SLOs per service | Alert fatigue, confusion about priorities, no clear "is this a problem?" signal |
The right number of SLOs per service: 2 to 4 per critical service. One availability SLO, one latency SLO, and optionally one for a business-critical operation (checkout, auth, data processing). More than that creates noise.
SLA design: what to include and what to exclude
SLAs are legal documents as much as technical ones. How you structure them determines whether they create accountability or create liability.
What to specify clearly:
- The specific services and components covered
- The measurement method and window (calendar month vs rolling 30 days)
- What counts as downtime (how many probe locations must fail, what response codes count)
- Notification requirements on your end
- Credit calculation formula and maximum credit amount
- Credit claim process and timeline
Exclusions that belong in every SLA:
- Scheduled maintenance windows (defined in advance with customer notification)
- Incidents caused by customer actions or third-party services outside your control
- Force majeure events
- Incidents during customer's free trial period
- Incidents that occurred but were not reported within a defined window
Credit structure example:
| Monthly uptime | Credit |
|---|---|
| 99.5% to 99.9% (SLO threshold to SLA threshold) | None - within SLA |
| 99.0% to 99.5% | 10% of monthly fee |
| 95.0% to 99.0% | 25% of monthly fee |
| Below 95.0% | 50% of monthly fee |
| Below 90.0% | 100% of monthly fee |
Maximum annual credit is typically capped at 3 to 6 months of fees to limit liability exposure.
The organizational challenges nobody talks about
SLOs require someone to own them. An SLO with no named owner degrades into a number that gets updated once a year before board meetings. Someone must review the error budget weekly, triage when it drops, and advocate for reliability work when the budget is at risk.
SLOs create prioritization conflict. When an error budget hits 15%, the right answer is to slow down feature deployments and focus on reliability. Product managers focused on roadmap velocity resist this. The SLO framework only works if engineering leadership is willing to enforce the error budget decision.
SLAs create legal exposure without the operational infrastructure to back them. Teams sometimes sign SLA commitments before they have the monitoring to track them, the incident response process to meet them, or the operational reliability to stay within them. The result: an SLA breach happens, the customer requests credits, and the team discovers they have no reliable data to dispute or confirm the claim.
For tracking SLA compliance with reliable data, see uptime SLA monitoring and MTTR, MTTD, MTBF metrics.
Starting from scratch: a minimal viable SLO
If your team has no SLOs yet, start with one:
- Pick your most critical user-facing API endpoint (auth, checkout, or core data API)
- Define the SLI: percentage of requests returning 2xx over a calendar month
- Look at your last 90 days of data. What was your actual success rate?
- Set your SLO at your 90th percentile actual performance (for example, if you've been at 99.8% most months, set the SLO at 99.5%)
- Calculate the error budget and post it somewhere visible
- Set an alert for when the budget drops below 50%
That's enough to start having meaningful reliability conversations. Expand from there once the team is comfortable with the framework.