What is the difference between SLA, SLO, and SLI?

An SLI (Service Level Indicator) is the raw measurement - for example, the percentage of HTTP requests that return 2xx in a given period. An SLO (Service Level Objective) is your internal target for that measurement - for example, 99.9% success rate. An SLA (Service Level Agreement) is the external contract with customers - for example, 99.5% uptime with service credits if you miss it.

Which is stricter: SLA or SLO?

SLOs should always be stricter than SLAs. If your SLA promises 99.5% uptime, your internal SLO should target 99.9%. The gap between your SLO and SLA is your operational safety margin - it gives your team room to detect and fix problems before they breach the external commitment.

What is an error budget?

An error budget is the amount of downtime or failure permitted by your SLO in a given period. If your SLO is 99.9% monthly availability, your error budget is 43.8 minutes per month. When the budget is consumed, engineering priorities shift toward reliability work.

Can a small SaaS team use SLOs without a dedicated SRE team?

Yes. You don't need a dedicated SRE team to use SLOs. Start with one SLI per critical user journey, set a realistic SLO target, and track it in a dashboard. The discipline of measuring and reviewing the number is the value, not the organizational structure around it.

SLA vs SLO vs SLI: Key Differences, Real Examples, and the Mistakes That Break Teams

SLI, SLO, and SLA are three distinct concepts that most teams use interchangeably until something breaks. The conflation is harmless until you are in an incident review trying to explain to a customer whether you breached your SLA, or until engineering and product disagree on whether a deployment should ship because the error budget is at 23%.

This guide defines each concept precisely, shows real examples of how they work together, and covers the specific mistakes that turn a good reliability framework into bureaucracy.

Definitions: the one-paragraph version

SLI (Service Level Indicator): A specific quantitative measurement of your service's behavior. SLIs are raw numbers - a percentage, a latency value, an error count. They answer the question: what did the service actually do?

SLO (Service Level Objective): A target range or threshold for an SLI, set internally by your engineering team. SLOs are commitments to yourselves. They answer: what should the service do?

SLA (Service Level Agreement): A contractual commitment made to external parties - customers, partners, regulators - about service behavior. SLAs carry consequences (refunds, credits, termination rights) when breached. They answer: what did we promise customers?

The relationship: SLIs are what you measure, SLOs are what you target, SLAs are what you commit to externally. Each one feeds into the next.

Side-by-side comparison

Dimension	SLI	SLO	SLA
What it is	Raw measurement	Internal target	External contract
Who sets it	Engineering	Engineering + Product	Business + Legal
Audience	Engineering team	Engineering + Leadership	Customers / Partners
Consequence if missed	Data point	Engineering priority shift	Refunds, credits, churn
Example	99.94% request success rate	99.9% request success rate	99.5% uptime
Typical strictness	Actual behavior	Tighter than SLA	More lenient than SLO

Real examples from common SaaS categories

E-commerce / transactional SaaS

Layer	SLI	SLO	SLA
Checkout availability	% of checkout requests returning 2xx	99.95% per month	99.5% per month
Payment processing latency	p99 latency for payment API	Under 2,000ms	Under 5,000ms
Order confirmation email	% of confirmation emails delivered in 5 min	99.5%	99%

API-first SaaS

Layer	SLI	SLO	SLA
API availability	% successful API responses	99.9% per calendar month	99.5% per calendar month
API latency	p95 response time across all endpoints	Under 500ms	Under 2,000ms
Webhook delivery	% of webhooks delivered within 30 seconds	99.5%	99%

Infrastructure / platform SaaS

Layer	SLI	SLO	SLA
Data plane availability	% of read/write operations succeeding	99.99%	99.9%
Job execution reliability	% of scheduled jobs completing without error	99.5%	99%
Dashboard availability	% of dashboard page loads returning 200	99.5%	99%

The SLO-SLA gap: why it exists and how big it should be

The gap between your SLO (internal target) and SLA (external commitment) is your operational margin. It gives your team room to detect and fix problems before they become contractual breaches.

The gap should reflect your actual operational capability:

Operational maturity	Suggested SLO-SLA gap
No dedicated SRE team, manual incident response	0.4 to 0.5 percentage points (e.g., SLA 99.5%, SLO 99.9%)
On-call rotation, automated alerting	0.2 to 0.3 percentage points
Dedicated SRE team, mature incident response	0.05 to 0.1 percentage points

Setting the gap too small (SLO ≈ SLA) means a single significant incident immediately threatens your contractual commitment. Setting it too large means your SLA is so lenient it provides no meaningful customer protection.

Error budgets: making SLOs usable

An error budget is the allowable failure in a given period derived from your SLO. It converts an abstract percentage into a concrete operational resource.

Error budget calculation:

Error budget = (1 - SLO target) × period duration

For a 99.9% monthly SLO:

Monthly period = 43,800 minutes
Error budget = 0.001 × 43,800 = 43.8 minutes per month

For a 99.5% monthly SLO:

Error budget = 0.005 × 43,800 = 219 minutes per month

How teams use error budgets:

Budget > 50% remaining: Engineering can deploy freely, experiment with reliability tradeoffs
Budget 20–50% remaining: Deploy with caution, require post-incident reviews for incidents
Budget < 20% remaining: Freeze non-critical deployments, prioritize reliability fixes
Budget exhausted: Stop new feature work until reliability is restored

Error budgets answer the question that causes the most team friction: "should we ship this risky change?" When the budget is healthy, yes. When it's nearly gone, no. The number removes the political dimension from the decision.

For monitoring how your error budget tracks in real time, see SLI, SLO, SLA implementation guide and uptime SLA monitoring.

Choosing the right SLIs

The most common SLI mistake is measuring the wrong thing. The right SLI measures what users experience, not what your infrastructure reports.

User-facing SLIs (measure these):

Request success rate (percentage of requests returning expected response)
Request latency (p50, p95, p99 latency for user-initiated requests)
Availability (percentage of time the service responds to checks)
Error rate (percentage of requests resulting in error responses)

Infrastructure SLIs (measure as diagnostics, not SLOs):

CPU utilization
Memory usage
Database connection pool saturation
Queue depth

CPU at 90% is not an SLI - users don't experience CPU utilization. Users experience slow responses, which you capture with a latency SLI. Infrastructure metrics are diagnostic signals that explain why an SLI degraded, not SLIs themselves.

What good SLOs look like vs common failure modes

Good SLO characteristics:

Tied to a specific user journey or service function, not a generic "uptime" claim
Measurable automatically from existing monitoring data
Reviewed and updated quarterly as traffic patterns change
Set with input from the people who will be paged when they are breached

Common SLO mistakes:

Mistake	What it looks like	Why it fails
SLO equals SLA	"We promise 99.9%, so we target 99.9%"	No operational margin; any significant incident immediately breaches the customer commitment
SLO too ambitious	Setting 99.999% when historical availability is 99.7%	The SLO is breached constantly, teams start ignoring alerts, the metric loses credibility
Wrong SLI	Measuring "server health" instead of user request success rate	Infrastructure looks fine while user experience degrades
No review cadence	Setting SLOs and never revisiting them	SLOs drift from reality as traffic patterns, dependencies, and architecture change
Too many SLOs	Tracking 40 SLOs per service	Alert fatigue, confusion about priorities, no clear "is this a problem?" signal

The right number of SLOs per service: 2 to 4 per critical service. One availability SLO, one latency SLO, and optionally one for a business-critical operation (checkout, auth, data processing). More than that creates noise.

SLA design: what to include and what to exclude

SLAs are legal documents as much as technical ones. How you structure them determines whether they create accountability or create liability.

What to specify clearly:

The specific services and components covered
The measurement method and window (calendar month vs rolling 30 days)
What counts as downtime (how many probe locations must fail, what response codes count)
Notification requirements on your end
Credit calculation formula and maximum credit amount
Credit claim process and timeline

Exclusions that belong in every SLA:

Scheduled maintenance windows (defined in advance with customer notification)
Incidents caused by customer actions or third-party services outside your control
Force majeure events
Incidents during customer's free trial period
Incidents that occurred but were not reported within a defined window

Credit structure example:

Monthly uptime	Credit
99.5% to 99.9% (SLO threshold to SLA threshold)	None - within SLA
99.0% to 99.5%	10% of monthly fee
95.0% to 99.0%	25% of monthly fee
Below 95.0%	50% of monthly fee
Below 90.0%	100% of monthly fee

Maximum annual credit is typically capped at 3 to 6 months of fees to limit liability exposure.

The organizational challenges nobody talks about

SLOs require someone to own them. An SLO with no named owner degrades into a number that gets updated once a year before board meetings. Someone must review the error budget weekly, triage when it drops, and advocate for reliability work when the budget is at risk.

SLOs create prioritization conflict. When an error budget hits 15%, the right answer is to slow down feature deployments and focus on reliability. Product managers focused on roadmap velocity resist this. The SLO framework only works if engineering leadership is willing to enforce the error budget decision.

SLAs create legal exposure without the operational infrastructure to back them. Teams sometimes sign SLA commitments before they have the monitoring to track them, the incident response process to meet them, or the operational reliability to stay within them. The result: an SLA breach happens, the customer requests credits, and the team discovers they have no reliable data to dispute or confirm the claim.

For tracking SLA compliance with reliable data, see uptime SLA monitoring and MTTR, MTTD, MTBF metrics.

Starting from scratch: a minimal viable SLO

If your team has no SLOs yet, start with one:

Pick your most critical user-facing API endpoint (auth, checkout, or core data API)
Define the SLI: percentage of requests returning 2xx over a calendar month
Look at your last 90 days of data. What was your actual success rate?
Set your SLO at your 90th percentile actual performance (for example, if you've been at 99.8% most months, set the SLO at 99.5%)
Calculate the error budget and post it somewhere visible
Set an alert for when the budget drops below 50%

That's enough to start having meaningful reliability conversations. Expand from there once the team is comfortable with the framework.