SLI, SLO, and SLA: What They Mean and How to Implement Them

SLI (Service Level Indicator) is a quantitative measurement of your service's behavior: request success rate, latency, availability percentage. An SLI answers the question "what is the service doing right now?"

SLO (Service Level Objective) is the target you set for an SLI: "99.9% of requests succeed," "p99 latency stays under 500ms." An SLO is an internal commitment to your team.

SLA (Service Level Agreement) is a contract with your customers that defines minimum service levels and the consequences (typically credits) when you fall short. An SLA is an external commitment.

Most teams confuse these three because the terms sound similar and all involve percentages. Getting them right is the foundation of every reliability program that actually works.

The Relationship Between SLI, SLO, and SLA

Think of them as a hierarchy:

SLI → what you measure
SLO → what you target internally
SLA → what you promise externally

Your SLO is always stricter than your SLA. If your SLA promises 99.9% uptime, your internal SLO should target 99.95% or better. The buffer between your SLO and SLA is your margin: the space that absorbs unexpected incidents before you breach your customer commitment.

A team that sets their SLO equal to their SLA is one bad week away from a contract violation.

SLIs: Choosing What to Measure

An SLI only matters if it reflects something your users actually experience. The most common mistake is measuring what's easy to collect (CPU usage, request count) rather than what indicates user experience.

Four SLI categories that map to user experience:

Category	What it measures	Example
Availability	Is the service responding at all?	% of HTTP requests returning 2xx or 3xx
Latency	Is it fast enough?	% of requests completing under 500ms
Error rate	Is it returning correct results?	% of requests not returning 5xx errors
Throughput	Is it handling the load?	Requests processed per second vs. expected

Availability SLI formula:

Availability = (Successful requests / Total requests) × 100

Where "successful" means the response was correct: right status code, valid content, within acceptable time.

Latency SLI formula:

Latency SLI = (Requests completing under threshold / Total requests) × 100

Most teams track p50 (median), p95, and p99 latency. The p99 is most important for user experience because it represents the worst 1% of requests - the users who always seem to be on your support tickets.

Choosing SLI Windows

SLIs are measured over a window of time. Common windows:

Rolling 7 days: Most sensitive; recent incidents weigh heavily
Rolling 28–30 days: Standard for monthly SLA reporting
Calendar month: Aligns with billing cycles; simplest to communicate to customers

Most teams use a 28-day rolling window for their SLO tracking and a calendar month for SLA reporting.

SLOs: Setting the Right Targets

A well-chosen SLO is ambitious enough to require real engineering effort but realistic enough to actually achieve. Too high and you spend all your time on reliability instead of features. Too low and outages become normalized.

The 99.9% Question

99.9% availability allows 43.8 minutes of downtime per month. For most B2B SaaS applications, this is a reasonable starting point.

Uptime target	Downtime per month	Suitable for
99%	7 hours 12 min	Non-critical internal tools
99.5%	3 hours 36 min	Lower-criticality B2B features
99.9%	43.8 minutes	Standard B2B SaaS
99.95%	21.9 minutes	SLA-driven enterprise
99.99%	4.4 minutes	Payment processors, auth systems

For a full breakdown of nines with exact downtime calculations, see SLA nines explained.

SLO Tiers for Different Services

Not all services in your product deserve the same SLO. A tiered approach matches engineering investment to actual user impact:

Tier 1 (99.95%+ SLO): Authentication, checkout, payment processing, primary API. These are in the critical path of user value.

Tier 2 (99.9% SLO): Core product features, data ingestion, user dashboard. Important but not immediately transactional.

Tier 3 (99.5% SLO): Admin interfaces, reporting, analytics dashboards. Degradation is noticed but not immediately business-critical.

Tier 4 (99% SLO): Internal tools, non-customer-facing services. Lower stakes.

Documenting SLOs

An SLO document should specify:

The SLI being measured (what metric, from where)
The target percentage
The measurement window (28-day rolling is standard)
The exclusions (planned maintenance, force majeure)
The owner (which team is responsible)

Error Budgets: Turning SLOs into Engineering Decisions

An error budget is the amount of downtime or errors your SLO permits over the measurement window. It's the practical tool that makes SLOs actionable.

Error budget formula:

Error budget = (1 - SLO target) × measurement window in minutes

For a 99.9% SLO over 30 days (43,200 minutes):

Error budget = (1 - 0.999) × 43,200 = 43.2 minutes

You have 43.2 minutes of downtime available this month before breaching your SLO.

Using Error Budgets to Guide Decisions

The error budget creates a natural feedback loop between reliability and feature work:

Budget is healthy (>50% remaining): Engineering can take more risk. Deploy features aggressively, accept more experimental changes.
Budget is at 50%: Moderate caution. Review deployment frequency and rollback practices.
Budget is low (<20%): Slow down. Focus engineering effort on reliability, not features. Postpone risky changes.
Budget is exhausted: Full stop on new deployments until reliability is restored. Conduct a postmortem. Run a reliability sprint.

This is not a manual decision - it's a policy. Teams that define the policy in advance avoid the arguments about "is now a good time to deploy?" during stressful moments.

SLAs: Making External Commitments

An SLA is a legal agreement. Before offering one, you need to know two things: what your SLO actually is (so you know what you can promise) and what consequences your business can accept (so you know what credits to offer).

SLA Credit Structures

Most SaaS SLAs offer monthly service credits as compensation for downtime below the promised uptime:

Uptime achieved	Credit
99.0% – 99.9%	10% of monthly fee
95.0% – 99.0%	25% of monthly fee
Below 95.0%	50% of monthly fee

This is a common structure. Your legal team and pricing model determine the specific thresholds and percentages.

What SLAs Should Exclude

Standard SLA exclusions:

Scheduled maintenance windows (with adequate advance notice, typically 24–72 hours)
Force majeure events
Customer-caused outages (misconfigured webhooks, API abuse)
Third-party provider outages outside your control
Failures caused by beta or preview features

Measuring SLA Compliance

You cannot report on SLA compliance without monitoring data. You need:

External monitoring that tracks availability from outside your infrastructure (synthetic probes, not internal health checks)
Per-incident duration data with accurate start and end timestamps
Planned maintenance records to exclude from availability calculations
Historical data retention long enough to cover your reporting period

Internal metrics from APM tools or application logs are insufficient for SLA reporting because they only capture what your application sees. An outage caused by DNS failure, network connectivity, or a broken load balancer may not appear in application metrics at all. External monitoring from multiple geographic locations is the only source that captures the same view your customers have.

SLO Monitoring in Practice

Tracking SLO Burn Rate

A burn rate tells you how fast you're consuming your error budget relative to the allowed rate. An SLO with a 28-day window burns at "1x" when it's on track to use exactly the error budget in 28 days.

If your burn rate is 2x, you'll exhaust the error budget in 14 days instead of 28 - the current incident or degradation is happening faster than your budget allows.

Burn rate formula:

Burn rate = (Current error rate / (1 - SLO target)) × 1

A burn rate above 2x in a 1-hour window typically warrants a page to the on-call engineer.

Alerting on SLO Violations

Don't alert on every SLI measurement. Alert on budget burn rate:

Fast burn alert: Error rate 2x+ the budget rate over the past 60 minutes. High priority, page the on-call engineer.
Slow burn alert: Error rate 5x+ the budget rate over the past 6 hours. Medium priority, alert the team.
Budget exhausted: SLO has been breached this period. Highest priority.

This approach reduces alert noise significantly compared to raw error rate alerting.

Building an SLO Dashboard

An effective SLO dashboard shows:

Current availability percentage (rolling 28 days)
Error budget remaining (% and absolute time)
Current burn rate
Time to budget exhaustion at current rate
Incident history for the period

Vantaj's uptime percentage tracking and incident history provide the raw data for this. For the error budget calculations, most teams build a lightweight dashboard in their observability platform using the uptime data as input.

Common Mistakes

Setting SLOs without measuring SLIs first. You can't set a meaningful target without baseline data. Run your monitoring for 30 days and look at actual availability before committing to a number.

SLA equals SLO. Always keep a buffer. Your SLA should be at least 0.05–0.1 percentage points below your SLO.

Measuring from internal health checks only. Internal health checks miss DNS failures, CDN issues, and network problems. External synthetic monitoring is required for accurate SLA reporting.

Ignoring planned maintenance. Every minute your service is unavailable counts against your SLO unless you have a formal maintenance window process with adequate customer notice.

Never reviewing the SLO. SLOs become meaningless if they're set once and never adjusted. Review them when you change architecture, when you hit a new reliability milestone, or when your SLA commitments change.

Frequently Asked Questions

What is the difference between SLA and SLO?

An SLA (Service Level Agreement) is a contract with customers defining minimum uptime commitments and the consequences of falling short, typically in the form of service credits. An SLO (Service Level Objective) is an internal target your team sets for itself, always stricter than the SLA. The gap between SLO and SLA is your buffer against contract breaches.

What is an error budget?

An error budget is the amount of downtime or errors your SLO permits over a given measurement window. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes. When that budget runs out, your SLO is breached. Error budgets translate abstract percentages into concrete time, making them easier to reason about in engineering decisions.

What is an SLI?

An SLI (Service Level Indicator) is the specific metric you measure to track service health: request success rate, response latency, error rate, or availability percentage. SLIs are the raw data that SLOs are defined against. A good SLI directly reflects what users experience, not just what's easy to measure from your infrastructure.

How do I calculate SLA uptime?

Uptime percentage = (total time in period - total downtime in period) / total time in period × 100. For accurate SLA reporting, use external synthetic monitoring data, not internal metrics. Internal metrics can miss outages caused by DNS, network, or load balancer issues that are invisible to your application. Exclude any planned maintenance windows that were communicated in advance per your SLA terms.

What SLO should I start with?

99.9% availability is a reasonable starting point for most B2B SaaS applications. Before committing, measure your actual current availability for 30 days using external monitoring. If you're currently achieving 99.5%, setting a 99.9% SLO requires real reliability investment. If you're already at 99.95%, you can set a more ambitious target.