Back to blog
Tutorials

Observability Explained: Metrics, Logs, Traces, and What Most Teams Miss

Observability is the ability to understand what a system is doing from the data it produces. This guide breaks down the three pillars (metrics, logs, and traces), how uptime monitoring fits in, and how to build an observability practice that catches problems before users do.

Theo Cummings · July 27, 2026 · 13 min read

Observability is the ability to understand what a system is doing from the data it produces. When an incident happens, an observable system lets your team determine what failed, where it failed, and why, using existing instrumentation, without writing new code to get the answer.

The concept comes from control theory, where an observable system is one whose internal state can be inferred from its outputs. Applied to software, it means your application emits enough signals that engineers can reconstruct what happened during any incident from the telemetry alone.

Observability matters more as systems grow more distributed. A monolithic application running on one server has limited ways to fail. A microservices architecture with 40 services, a message queue, and three external APIs can fail in thousands of combinations. Without observability, diagnosing those failures is guesswork.

The three pillars: metrics, logs, and traces

Almost every observability framework starts with these three data types. They answer different questions and complement each other. Understanding what each one does, and what it cannot do, tells you where to invest your instrumentation effort.

Metrics

Metrics are numeric measurements collected at regular intervals over time. They are the fastest data to query and store, and the best tool for detecting that something changed.

A metric has three parts:

  1. A name (http_request_duration_seconds, database_connections_active)
  2. A value at a point in time (245.3, 42)
  3. Labels that describe context (method=GET, status=200, region=eu-west-1)

Common metric types:

TypeWhat it measuresExample
CounterA value that only increasesTotal HTTP requests served
GaugeA value that goes up and downActive database connections
HistogramDistribution of values across bucketsResponse time percentiles (P50, P95, P99)
SummaryPre-computed percentiles from the applicationP99 latency calculated at the app level

Metrics are cheap to store and fast to query. Prometheus, Datadog, and InfluxDB store millions of time-series data points efficiently. The trade-off: metrics tell you that something changed but not why.

Your error rate jumping from 0.1% to 8% at 2:47 AM tells you an incident started. It does not tell you which code path is failing, which user is affected, or which dependency caused it.

The four golden signals (from Google's SRE book) are the minimum metric set for any production service:

  • Latency: How long requests take (and how long errors take separately from successes)
  • Traffic: How many requests per second the service is handling
  • Errors: The rate of failed requests (5xx, timeouts, explicit errors)
  • Saturation: How close the service is to its capacity limit (CPU, memory, connection pool)

These four metrics give you the most diagnostic leverage per metric tracked.

Logs

Logs are timestamped records of events. Every time your application does something worth recording, it emits a log entry: a request arrived, a database query ran, an error occurred, a background job completed.

Logs provide context that metrics cannot. Where a metric tells you error rate is 8%, logs tell you which specific requests failed, what the error messages were, and which stack trace they produced.

Structured vs unstructured logs:

Unstructured logs look like this:

2026-07-03 14:32:11 ERROR Failed to connect to database: timeout after 5000ms

Structured logs look like this:

{
  "timestamp": "2026-07-03T14:32:11Z",
  "level": "error",
  "message": "database connection failed",
  "error": "timeout",
  "timeout_ms": 5000,
  "service": "api",
  "request_id": "req_8xK2mNp"
}

Structured logs are searchable by field. You can query for all errors from the api service with timeouts over 3000ms, then correlate them with a specific deployment or traffic spike. Unstructured logs require parsing that breaks when the format changes.

Move to structured logging first. It is the highest-leverage improvement in most observability practices.

What to log at each level:

LevelWhen to use itExample
ERRORUnexpected failures that need investigationDatabase query failed
WARNUnexpected conditions that did not cause failureCache miss rate above threshold
INFONormal significant eventsUser signed up, payment processed
DEBUGDetailed diagnostic informationSkipping INFO; only enable in development

Log ERROR sparingly. Every error log should be actionable. If you cannot do anything about it, it is not an error; it is noise. Teams that log everything at ERROR level train themselves to ignore the logs, and that defeats the purpose.

Traces

Traces track a single request as it moves through your system. In a distributed architecture, one user action can trigger calls to five internal services, two external APIs, and three database queries. A trace follows that request through every hop, recording timing at each step.

A trace is made of spans. Each span represents one operation in the request path:

User request (total: 347ms)
├── Auth service (23ms)
├── Product API (198ms)
│   ├── Cache lookup (4ms) [MISS]
│   ├── Database query (187ms) [SLOW]
│   └── Cache write (7ms)
└── Response serialization (9ms)

Looking at this trace, you immediately see that the database query inside the Product API is consuming 54% of the total request time. Without tracing, you would see a slow P95 latency metric and spend time guessing which service caused it.

When traces matter most: Traces are essential in microservices and when diagnosing latency issues. They answer "which service is slow?" in a way that metrics cannot.

OpenTelemetry is the standard instrumentation framework for traces. It provides vendor-neutral SDKs for most programming languages (Python, Node.js, Java, Go, Ruby, .NET) and an agent that ships traces to any compatible backend: Jaeger, Zipkin, Datadog, Grafana Tempo, AWS X-Ray.

How monitoring and observability differ

Monitoring and observability are related but not the same thing.

Monitoring is the practice of watching predefined metrics and alerting when they breach thresholds. You decide in advance what matters, set thresholds, and get paged when thresholds break.

Observability is a property of your system: it is observable if engineers can understand what happened during any failure using the data it produces, including failures you did not anticipate.

The distinction matters most during novel failures. If your service fails in a way you have never seen before, monitoring may not catch it (you did not define a metric for that failure mode). An observable system lets you explore: query logs by time window, look for anomalous traces, compare metric trends across services.

A practical way to think about it:

  • Monitoring answers: is something wrong?
  • Observability answers: why is something wrong?

You need both. Monitoring detects. Observability diagnoses.

Uptime monitoring in the observability stack

Uptime monitoring is your system's external signal. Everything discussed above, metrics, logs, and traces, comes from inside your infrastructure. Uptime monitoring comes from outside.

It answers the question your customers are asking: can I reach this service right now?

A server can look healthy from the inside while failing completely for users. Your metrics show normal CPU, memory, and database connection counts. Your logs show no errors. But your load balancer is misconfigured and routing zero traffic to the healthy app servers. Users get 502 errors. Your internal telemetry sees nothing wrong.

Uptime monitoring catches this because it checks from outside. It sends a real HTTP request, validates the response code, checks for expected content, measures latency, and records the result. If the response is wrong, the monitor knows before any internal signal fires.

Think of uptime monitoring as the user-perspective pillar of observability:

PillarPerspectiveAnswers
MetricsInternalIs the system healthy by its own measurements?
LogsInternalWhat events occurred and in what sequence?
TracesInternalWhich path did this request take?
Uptime monitoringExternalIs the service working from the user's perspective?

For SaaS companies, uptime monitoring is where observability starts. It establishes your ground truth: is the service up or not? Everything else in your stack explains why.

See the uptime monitoring guide for how to build the external monitoring layer.

Building an observability practice

Step 1: Establish external monitoring first

Before instrumenting your application, know whether it is up. Add uptime checks for every customer-facing endpoint. Run them from multiple regions. Configure alerts that fire when checks fail consistently, not on the first failure.

This gives you the user-facing signal. Everything else you add improves your ability to explain what the uptime check already detected.

Step 2: Add structured logging

Switch from unstructured to structured logging. Add a request ID that propagates through all service calls. Log errors with enough context to reproduce the failure.

A request ID is the cheapest form of tracing. Even without a full distributed tracing system, you can filter logs by request ID and reconstruct what happened during a specific user's failed request.

Step 3: Instrument the four golden signals

Add metrics for latency, traffic, errors, and saturation. Expose a /metrics endpoint in Prometheus format or send metrics to your APM tool. Build dashboards for each service that show these four signals for the past 24 hours.

At this point you can detect most production incidents and see whether they affect specific services or all services simultaneously.

Step 4: Add distributed tracing

Instrument your most latency-sensitive services first. Use OpenTelemetry so you can switch backend vendors without re-instrumenting. Trace every request that crosses a service boundary.

Start with your five most-called API endpoints. Once those are traced, move to your background jobs and async workers.

Step 5: Build SLO-based alerting

Shift from threshold-based alerts ("alert when P95 latency exceeds 500ms") to SLO-based alerts ("alert when error budget is burning faster than expected").

SLO-based alerting reduces alert fatigue by focusing on customer impact rather than internal thresholds. A P95 latency of 550ms at 2 AM on a Sunday affects almost no users. The same latency at 9 AM on Monday consumes error budget at 10x the normal rate and warrants a page.

Read how to reduce false positive alerts for the alert configuration approach that keeps your on-call rotation focused on real problems.

Observability tooling

The observability tool market covers three layers:

Collection and storage:

  • Prometheus: open-source metrics collection and storage; the de facto standard for Kubernetes environments
  • Loki: Grafana's log aggregation system; indexes metadata rather than log content, keeping costs low
  • Jaeger / Zipkin: open-source distributed tracing backends

Analysis and visualization:

  • Grafana: the dominant open-source dashboard layer; works with Prometheus, Loki, Tempo, and most commercial backends
  • Kibana: Elastic's visualization layer for the ELK stack (Elasticsearch, Logstash, Kibana)

Commercial platforms:

  • Datadog: metrics, logs, traces, and APM in one platform; high cost at scale
  • New Relic: similar all-in-one approach; per-user pricing can be cheaper for large teams
  • Grafana Cloud: managed Prometheus, Loki, and Tempo; predictable pricing based on data volume

See the best observability tools comparison for a full breakdown by use case and team size.

Common observability mistakes

Logging everything at ERROR level. If every log is an error, no log is an error. Use log levels deliberately. ERROR should be actionable. INFO should be meaningful. Skip DEBUG in production.

Metrics without percentiles. Average latency hides outliers. A P99 of 4 seconds on an endpoint with a 200ms average means 1% of your users wait 20 times longer than normal. Always track P95 and P99.

Traces that stop at the first service boundary. A trace that ends at your API gateway tells you nothing about which downstream service is slow. Instrument every service that participates in a user-facing request.

Treating uptime monitoring as optional. Your internal telemetry is not your user's experience. Run external checks. They catch failures that internal monitoring misses.

Too many dashboards, too few alerts. Dashboards require someone to look at them. Alerts reach people when they are not looking. Build focused dashboards for incident triage, but rely on alerts for detection.

Alert on symptoms, not causes. Alert when users are affected (high error rate, high latency, service down), not when internal metrics change (CPU over 70%, cache miss rate increased). User-facing symptoms are what the on-call engineer needs to know about.

Observability and MTTR

Mean Time to Resolution (MTTR) is the metric that observability improves most directly. The gap between detecting an incident and resolving it is diagnosis time: figuring out what failed and why.

A team with good observability detects the same incident as a team without it. But the team with good observability diagnoses it in 8 minutes because they can correlate a spike in database query duration with a deployment that ran 12 minutes earlier. The team without it spends 90 minutes checking services one by one.

Observability investment pays off in MTTR reduction. Track your MTTR month over month as you add instrumentation. The improvement is measurable.

See API monitoring: how to monitor REST APIs for how to apply observability principles to your API layer specifically.