What is observability?

Observability is the ability to understand the internal state of a system by examining the data it produces: metrics, logs, and traces. A system is fully observable when engineers can determine what went wrong, where it happened, and why, without writing new code or deploying new instrumentation.

What are the three pillars of observability?

The three pillars of observability are metrics (numeric measurements over time, such as error rate and latency), logs (timestamped event records from your application and infrastructure), and traces (records of requests as they travel through distributed services).

What is the difference between monitoring and observability?

Monitoring watches predefined metrics and alerts when they breach thresholds. Observability is a broader property: a system is observable when you can explore unknown failure modes using the data it emits, not just check whether known metrics cross known limits.

Where does uptime monitoring fit in observability?

Uptime monitoring is the external, user-perspective signal in your observability stack. It confirms whether your service is reachable and responding correctly from outside your infrastructure. Internal metrics, logs, and traces tell you why; uptime monitoring tells you whether.

What does OpenTelemetry do?

OpenTelemetry is an open-source framework that standardizes how applications emit metrics, logs, and traces. It provides vendor-neutral instrumentation libraries for most programming languages and an agent that collects and exports telemetry to any compatible backend.

How mature is my observability practice?

A basic practice has centralized logs and uptime checks. An intermediate practice adds metrics dashboards and distributed tracing. An advanced practice has SLO-driven alerting, automated anomaly detection, and continuous profiling. Most SaaS teams operate at the intermediate level.

Observability Explained: Metrics, Logs, Traces, and What Most Teams Miss

Observability is the ability to understand what a system is doing from the data it produces. When an incident happens, an observable system lets your team determine what failed, where it failed, and why, using existing instrumentation, without writing new code to get the answer.

The concept comes from control theory, where an observable system is one whose internal state can be inferred from its outputs. Applied to software, it means your application emits enough signals that engineers can reconstruct what happened during any incident from the telemetry alone.

Observability matters more as systems grow more distributed. A monolithic application running on one server has limited ways to fail. A microservices architecture with 40 services, a message queue, and three external APIs can fail in thousands of combinations. Without observability, diagnosing those failures is guesswork.

The three pillars: metrics, logs, and traces

Almost every observability framework starts with these three data types. They answer different questions and complement each other. Understanding what each one does, and what it cannot do, tells you where to invest your instrumentation effort.

Metrics

Metrics are numeric measurements collected at regular intervals over time. They are the fastest data to query and store, and the best tool for detecting that something changed.

A metric has three parts:

A name (http_request_duration_seconds, database_connections_active)
A value at a point in time (245.3, 42)
Labels that describe context (method=GET, status=200, region=eu-west-1)

Common metric types:

Type	What it measures	Example
Counter	A value that only increases	Total HTTP requests served
Gauge	A value that goes up and down	Active database connections
Histogram	Distribution of values across buckets	Response time percentiles (P50, P95, P99)
Summary	Pre-computed percentiles from the application	P99 latency calculated at the app level

Metrics are cheap to store and fast to query. Prometheus, Datadog, and InfluxDB store millions of time-series data points efficiently. The trade-off: metrics tell you that something changed but not why.

Your error rate jumping from 0.1% to 8% at 2:47 AM tells you an incident started. It does not tell you which code path is failing, which user is affected, or which dependency caused it.

The four golden signals (from Google's SRE book) are the minimum metric set for any production service:

Latency: How long requests take (and how long errors take separately from successes)
Traffic: How many requests per second the service is handling
Errors: The rate of failed requests (5xx, timeouts, explicit errors)
Saturation: How close the service is to its capacity limit (CPU, memory, connection pool)

These four metrics give you the most diagnostic leverage per metric tracked.

Logs

Logs are timestamped records of events. Every time your application does something worth recording, it emits a log entry: a request arrived, a database query ran, an error occurred, a background job completed.

Logs provide context that metrics cannot. Where a metric tells you error rate is 8%, logs tell you which specific requests failed, what the error messages were, and which stack trace they produced.

Structured vs unstructured logs:

Unstructured logs look like this:

2026-07-03 14:32:11 ERROR Failed to connect to database: timeout after 5000ms

Structured logs look like this:

{
  "timestamp": "2026-07-03T14:32:11Z",
  "level": "error",
  "message": "database connection failed",
  "error": "timeout",
  "timeout_ms": 5000,
  "service": "api",
  "request_id": "req_8xK2mNp"
}

Structured logs are searchable by field. You can query for all errors from the api service with timeouts over 3000ms, then correlate them with a specific deployment or traffic spike. Unstructured logs require parsing that breaks when the format changes.

Move to structured logging first. It is the highest-leverage improvement in most observability practices.

What to log at each level:

Level	When to use it	Example
`ERROR`	Unexpected failures that need investigation	Database query failed
`WARN`	Unexpected conditions that did not cause failure	Cache miss rate above threshold
`INFO`	Normal significant events	User signed up, payment processed
`DEBUG`	Detailed diagnostic information	Skipping INFO; only enable in development

Log ERROR sparingly. Every error log should be actionable. If you cannot do anything about it, it is not an error; it is noise. Teams that log everything at ERROR level train themselves to ignore the logs, and that defeats the purpose.

Traces

Traces track a single request as it moves through your system. In a distributed architecture, one user action can trigger calls to five internal services, two external APIs, and three database queries. A trace follows that request through every hop, recording timing at each step.

A trace is made of spans. Each span represents one operation in the request path:

User request (total: 347ms)
├── Auth service (23ms)
├── Product API (198ms)
│   ├── Cache lookup (4ms) [MISS]
│   ├── Database query (187ms) [SLOW]
│   └── Cache write (7ms)
└── Response serialization (9ms)

Looking at this trace, you immediately see that the database query inside the Product API is consuming 54% of the total request time. Without tracing, you would see a slow P95 latency metric and spend time guessing which service caused it.

When traces matter most: Traces are essential in microservices and when diagnosing latency issues. They answer "which service is slow?" in a way that metrics cannot.

OpenTelemetry is the standard instrumentation framework for traces. It provides vendor-neutral SDKs for most programming languages (Python, Node.js, Java, Go, Ruby, .NET) and an agent that ships traces to any compatible backend: Jaeger, Zipkin, Datadog, Grafana Tempo, AWS X-Ray.

How monitoring and observability differ

Monitoring and observability are related but not the same thing.

Monitoring is the practice of watching predefined metrics and alerting when they breach thresholds. You decide in advance what matters, set thresholds, and get paged when thresholds break.

Observability is a property of your system: it is observable if engineers can understand what happened during any failure using the data it produces, including failures you did not anticipate.

The distinction matters most during novel failures. If your service fails in a way you have never seen before, monitoring may not catch it (you did not define a metric for that failure mode). An observable system lets you explore: query logs by time window, look for anomalous traces, compare metric trends across services.

A practical way to think about it:

Monitoring answers: is something wrong?
Observability answers: why is something wrong?

You need both. Monitoring detects. Observability diagnoses.

Uptime monitoring in the observability stack

Uptime monitoring is your system's external signal. Everything discussed above, metrics, logs, and traces, comes from inside your infrastructure. Uptime monitoring comes from outside.

It answers the question your customers are asking: can I reach this service right now?

A server can look healthy from the inside while failing completely for users. Your metrics show normal CPU, memory, and database connection counts. Your logs show no errors. But your load balancer is misconfigured and routing zero traffic to the healthy app servers. Users get 502 errors. Your internal telemetry sees nothing wrong.

Uptime monitoring catches this because it checks from outside. It sends a real HTTP request, validates the response code, checks for expected content, measures latency, and records the result. If the response is wrong, the monitor knows before any internal signal fires.

Think of uptime monitoring as the user-perspective pillar of observability:

Pillar	Perspective	Answers
Metrics	Internal	Is the system healthy by its own measurements?
Logs	Internal	What events occurred and in what sequence?
Traces	Internal	Which path did this request take?
Uptime monitoring	External	Is the service working from the user's perspective?

For SaaS companies, uptime monitoring is where observability starts. It establishes your ground truth: is the service up or not? Everything else in your stack explains why.

See the uptime monitoring guide for how to build the external monitoring layer.

Building an observability practice

Step 1: Establish external monitoring first

Before instrumenting your application, know whether it is up. Add uptime checks for every customer-facing endpoint. Run them from multiple regions. Configure alerts that fire when checks fail consistently, not on the first failure.

This gives you the user-facing signal. Everything else you add improves your ability to explain what the uptime check already detected.

Step 2: Add structured logging

Switch from unstructured to structured logging. Add a request ID that propagates through all service calls. Log errors with enough context to reproduce the failure.

A request ID is the cheapest form of tracing. Even without a full distributed tracing system, you can filter logs by request ID and reconstruct what happened during a specific user's failed request.

Step 3: Instrument the four golden signals

Add metrics for latency, traffic, errors, and saturation. Expose a /metrics endpoint in Prometheus format or send metrics to your APM tool. Build dashboards for each service that show these four signals for the past 24 hours.

At this point you can detect most production incidents and see whether they affect specific services or all services simultaneously.

Step 4: Add distributed tracing

Instrument your most latency-sensitive services first. Use OpenTelemetry so you can switch backend vendors without re-instrumenting. Trace every request that crosses a service boundary.

Start with your five most-called API endpoints. Once those are traced, move to your background jobs and async workers.

Step 5: Build SLO-based alerting

Shift from threshold-based alerts ("alert when P95 latency exceeds 500ms") to SLO-based alerts ("alert when error budget is burning faster than expected").

SLO-based alerting reduces alert fatigue by focusing on customer impact rather than internal thresholds. A P95 latency of 550ms at 2 AM on a Sunday affects almost no users. The same latency at 9 AM on Monday consumes error budget at 10x the normal rate and warrants a page.

Read how to reduce false positive alerts for the alert configuration approach that keeps your on-call rotation focused on real problems.

Observability tooling

The observability tool market covers three layers:

Collection and storage:

Prometheus: open-source metrics collection and storage; the de facto standard for Kubernetes environments
Loki: Grafana's log aggregation system; indexes metadata rather than log content, keeping costs low
Jaeger / Zipkin: open-source distributed tracing backends

Analysis and visualization:

Grafana: the dominant open-source dashboard layer; works with Prometheus, Loki, Tempo, and most commercial backends
Kibana: Elastic's visualization layer for the ELK stack (Elasticsearch, Logstash, Kibana)

Commercial platforms:

Datadog: metrics, logs, traces, and APM in one platform; high cost at scale
New Relic: similar all-in-one approach; per-user pricing can be cheaper for large teams
Grafana Cloud: managed Prometheus, Loki, and Tempo; predictable pricing based on data volume

See the best observability tools comparison for a full breakdown by use case and team size.

Common observability mistakes

Logging everything at ERROR level. If every log is an error, no log is an error. Use log levels deliberately. ERROR should be actionable. INFO should be meaningful. Skip DEBUG in production.

Metrics without percentiles. Average latency hides outliers. A P99 of 4 seconds on an endpoint with a 200ms average means 1% of your users wait 20 times longer than normal. Always track P95 and P99.

Traces that stop at the first service boundary. A trace that ends at your API gateway tells you nothing about which downstream service is slow. Instrument every service that participates in a user-facing request.

Treating uptime monitoring as optional. Your internal telemetry is not your user's experience. Run external checks. They catch failures that internal monitoring misses.

Too many dashboards, too few alerts. Dashboards require someone to look at them. Alerts reach people when they are not looking. Build focused dashboards for incident triage, but rely on alerts for detection.

Alert on symptoms, not causes. Alert when users are affected (high error rate, high latency, service down), not when internal metrics change (CPU over 70%, cache miss rate increased). User-facing symptoms are what the on-call engineer needs to know about.

Observability and MTTR

Mean Time to Resolution (MTTR) is the metric that observability improves most directly. The gap between detecting an incident and resolving it is diagnosis time: figuring out what failed and why.

A team with good observability detects the same incident as a team without it. But the team with good observability diagnoses it in 8 minutes because they can correlate a spike in database query duration with a deployment that ran 12 minutes earlier. The team without it spends 90 minutes checking services one by one.

Observability investment pays off in MTTR reduction. Track your MTTR month over month as you add instrumentation. The improvement is measurable.

See API monitoring: how to monitor REST APIs for how to apply observability principles to your API layer specifically.