Back to blog
Infrastructure

SLA vs SLO vs SLI: Key Differences, Real Examples, and the Mistakes That Break Teams

SLI, SLO, and SLA are three distinct concepts that most engineering teams conflate. This guide covers the real differences with concrete examples, common failure modes, and how to implement all three without turning them into bureaucracy.

Theo Cummings · July 10, 2026 · 12 min read

SLI, SLO, and SLA are three distinct concepts that most teams use interchangeably until something breaks. The conflation is harmless until you are in an incident review trying to explain to a customer whether you breached your SLA, or until engineering and product disagree on whether a deployment should ship because the error budget is at 23%.

This guide defines each concept precisely, shows real examples of how they work together, and covers the specific mistakes that turn a good reliability framework into bureaucracy.

Definitions: the one-paragraph version

SLI (Service Level Indicator): A specific quantitative measurement of your service's behavior. SLIs are raw numbers - a percentage, a latency value, an error count. They answer the question: what did the service actually do?

SLO (Service Level Objective): A target range or threshold for an SLI, set internally by your engineering team. SLOs are commitments to yourselves. They answer: what should the service do?

SLA (Service Level Agreement): A contractual commitment made to external parties - customers, partners, regulators - about service behavior. SLAs carry consequences (refunds, credits, termination rights) when breached. They answer: what did we promise customers?

The relationship: SLIs are what you measure, SLOs are what you target, SLAs are what you commit to externally. Each one feeds into the next.

Side-by-side comparison

DimensionSLISLOSLA
What it isRaw measurementInternal targetExternal contract
Who sets itEngineeringEngineering + ProductBusiness + Legal
AudienceEngineering teamEngineering + LeadershipCustomers / Partners
Consequence if missedData pointEngineering priority shiftRefunds, credits, churn
Example99.94% request success rate99.9% request success rate99.5% uptime
Typical strictnessActual behaviorTighter than SLAMore lenient than SLO

Real examples from common SaaS categories

E-commerce / transactional SaaS

LayerSLISLOSLA
Checkout availability% of checkout requests returning 2xx99.95% per month99.5% per month
Payment processing latencyp99 latency for payment APIUnder 2,000msUnder 5,000ms
Order confirmation email% of confirmation emails delivered in 5 min99.5%99%

API-first SaaS

LayerSLISLOSLA
API availability% successful API responses99.9% per calendar month99.5% per calendar month
API latencyp95 response time across all endpointsUnder 500msUnder 2,000ms
Webhook delivery% of webhooks delivered within 30 seconds99.5%99%

Infrastructure / platform SaaS

LayerSLISLOSLA
Data plane availability% of read/write operations succeeding99.99%99.9%
Job execution reliability% of scheduled jobs completing without error99.5%99%
Dashboard availability% of dashboard page loads returning 20099.5%99%

The SLO-SLA gap: why it exists and how big it should be

The gap between your SLO (internal target) and SLA (external commitment) is your operational margin. It gives your team room to detect and fix problems before they become contractual breaches.

The gap should reflect your actual operational capability:

Operational maturitySuggested SLO-SLA gap
No dedicated SRE team, manual incident response0.4 to 0.5 percentage points (e.g., SLA 99.5%, SLO 99.9%)
On-call rotation, automated alerting0.2 to 0.3 percentage points
Dedicated SRE team, mature incident response0.05 to 0.1 percentage points

Setting the gap too small (SLO ≈ SLA) means a single significant incident immediately threatens your contractual commitment. Setting it too large means your SLA is so lenient it provides no meaningful customer protection.

Error budgets: making SLOs usable

An error budget is the allowable failure in a given period derived from your SLO. It converts an abstract percentage into a concrete operational resource.

Error budget calculation:

Error budget = (1 - SLO target) × period duration

For a 99.9% monthly SLO:

  • Monthly period = 43,800 minutes
  • Error budget = 0.001 × 43,800 = 43.8 minutes per month

For a 99.5% monthly SLO:

  • Error budget = 0.005 × 43,800 = 219 minutes per month

How teams use error budgets:

  • Budget > 50% remaining: Engineering can deploy freely, experiment with reliability tradeoffs
  • Budget 20–50% remaining: Deploy with caution, require post-incident reviews for incidents
  • Budget < 20% remaining: Freeze non-critical deployments, prioritize reliability fixes
  • Budget exhausted: Stop new feature work until reliability is restored

Error budgets answer the question that causes the most team friction: "should we ship this risky change?" When the budget is healthy, yes. When it's nearly gone, no. The number removes the political dimension from the decision.

For monitoring how your error budget tracks in real time, see SLI, SLO, SLA implementation guide and uptime SLA monitoring.

Choosing the right SLIs

The most common SLI mistake is measuring the wrong thing. The right SLI measures what users experience, not what your infrastructure reports.

User-facing SLIs (measure these):

  • Request success rate (percentage of requests returning expected response)
  • Request latency (p50, p95, p99 latency for user-initiated requests)
  • Availability (percentage of time the service responds to checks)
  • Error rate (percentage of requests resulting in error responses)

Infrastructure SLIs (measure as diagnostics, not SLOs):

  • CPU utilization
  • Memory usage
  • Database connection pool saturation
  • Queue depth

CPU at 90% is not an SLI - users don't experience CPU utilization. Users experience slow responses, which you capture with a latency SLI. Infrastructure metrics are diagnostic signals that explain why an SLI degraded, not SLIs themselves.

What good SLOs look like vs common failure modes

Good SLO characteristics:

  • Tied to a specific user journey or service function, not a generic "uptime" claim
  • Measurable automatically from existing monitoring data
  • Reviewed and updated quarterly as traffic patterns change
  • Set with input from the people who will be paged when they are breached

Common SLO mistakes:

MistakeWhat it looks likeWhy it fails
SLO equals SLA"We promise 99.9%, so we target 99.9%"No operational margin; any significant incident immediately breaches the customer commitment
SLO too ambitiousSetting 99.999% when historical availability is 99.7%The SLO is breached constantly, teams start ignoring alerts, the metric loses credibility
Wrong SLIMeasuring "server health" instead of user request success rateInfrastructure looks fine while user experience degrades
No review cadenceSetting SLOs and never revisiting themSLOs drift from reality as traffic patterns, dependencies, and architecture change
Too many SLOsTracking 40 SLOs per serviceAlert fatigue, confusion about priorities, no clear "is this a problem?" signal

The right number of SLOs per service: 2 to 4 per critical service. One availability SLO, one latency SLO, and optionally one for a business-critical operation (checkout, auth, data processing). More than that creates noise.

SLA design: what to include and what to exclude

SLAs are legal documents as much as technical ones. How you structure them determines whether they create accountability or create liability.

What to specify clearly:

  • The specific services and components covered
  • The measurement method and window (calendar month vs rolling 30 days)
  • What counts as downtime (how many probe locations must fail, what response codes count)
  • Notification requirements on your end
  • Credit calculation formula and maximum credit amount
  • Credit claim process and timeline

Exclusions that belong in every SLA:

  • Scheduled maintenance windows (defined in advance with customer notification)
  • Incidents caused by customer actions or third-party services outside your control
  • Force majeure events
  • Incidents during customer's free trial period
  • Incidents that occurred but were not reported within a defined window

Credit structure example:

Monthly uptimeCredit
99.5% to 99.9% (SLO threshold to SLA threshold)None - within SLA
99.0% to 99.5%10% of monthly fee
95.0% to 99.0%25% of monthly fee
Below 95.0%50% of monthly fee
Below 90.0%100% of monthly fee

Maximum annual credit is typically capped at 3 to 6 months of fees to limit liability exposure.

The organizational challenges nobody talks about

SLOs require someone to own them. An SLO with no named owner degrades into a number that gets updated once a year before board meetings. Someone must review the error budget weekly, triage when it drops, and advocate for reliability work when the budget is at risk.

SLOs create prioritization conflict. When an error budget hits 15%, the right answer is to slow down feature deployments and focus on reliability. Product managers focused on roadmap velocity resist this. The SLO framework only works if engineering leadership is willing to enforce the error budget decision.

SLAs create legal exposure without the operational infrastructure to back them. Teams sometimes sign SLA commitments before they have the monitoring to track them, the incident response process to meet them, or the operational reliability to stay within them. The result: an SLA breach happens, the customer requests credits, and the team discovers they have no reliable data to dispute or confirm the claim.

For tracking SLA compliance with reliable data, see uptime SLA monitoring and MTTR, MTTD, MTBF metrics.

Starting from scratch: a minimal viable SLO

If your team has no SLOs yet, start with one:

  1. Pick your most critical user-facing API endpoint (auth, checkout, or core data API)
  2. Define the SLI: percentage of requests returning 2xx over a calendar month
  3. Look at your last 90 days of data. What was your actual success rate?
  4. Set your SLO at your 90th percentile actual performance (for example, if you've been at 99.8% most months, set the SLO at 99.5%)
  5. Calculate the error budget and post it somewhere visible
  6. Set an alert for when the budget drops below 50%

That's enough to start having meaningful reliability conversations. Expand from there once the team is comfortable with the framework.