What is the difference between monitoring and observability?

Monitoring tells you when something is broken by tracking predefined metrics and triggering alerts. Observability tells you why it is broken by giving you the ability to ask arbitrary questions of your system through metrics, logs, and traces. Monitoring is a subset of observability -- you need monitoring, but observability gives you the deeper understanding required to debug complex systems.

What are the three pillars of observability?

The three pillars are metrics (numerical measurements over time like CPU usage or request latency), logs (timestamped text records of discrete events), and traces (end-to-end records of a request as it flows through multiple services). Together, they give you a complete picture of system behaviour.

What are the four golden signals?

The four golden signals, defined by Google's SRE book, are latency (how long requests take), traffic (how many requests you are handling), errors (the rate of failed requests), and saturation (how full your resources are). If you monitor only four things, monitor these.

Which monitoring tools should I learn first?

Start with Prometheus for metrics collection and Grafana for visualisation. This combination is the industry standard for infrastructure and application monitoring. Once comfortable, add the ELK stack (Elasticsearch, Logstash, Kibana) for log management and Jaeger or Zipkin for distributed tracing.

How much monitoring is enough?

At minimum, monitor the four golden signals for every service: latency, traffic, errors, and saturation. Add structured logging for debugging and tracing for any system with more than two communicating services. Over-monitoring creates alert fatigue, so focus on signals that require human action rather than tracking everything possible.

Monitoring and Observability 101: Why It Matters

Monitoring tells you when something is broken. Observability tells you why it is broken. You need both, and every production system that serves real users needs them from day one -- not as an afterthought bolted on after the first outage.

The distinction matters because modern systems are complex. A single user request might pass through a load balancer, an API gateway, three microservices, two databases, and a cache before returning a response. When that request fails or slows down, monitoring tells you something is wrong. Observability tells you which service, which function, which database query, and why.

This guide covers the foundations: what to monitor, what tools to use, how to build dashboards that actually help, and how to set up alerting that does not wake you up at 3am for problems that do not matter. For the broader context of how monitoring fits into the DevOps toolchain, see the complete DevOps tools guide.

Monitoring vs observability

These terms are often used interchangeably, but they describe different things.

Monitoring

Monitoring is the practice of collecting predefined measurements from your systems and alerting when those measurements cross a threshold. It answers questions you have already thought of:

Is the server running?
Is CPU usage above 80%?
Are more than 1% of requests returning errors?
Is the database running out of connections?

Monitoring is reactive. You define what to watch, set thresholds, and wait for alerts. It works well for known problems -- the ones you have already experienced or anticipated.

Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. It answers questions you have not thought of yet:

Why are requests from users in Europe 3x slower than requests from the US?
Which specific database query is causing the latency spike?
Why did the error rate increase after the last deployment, but only for authenticated users?

Observability is proactive. Instead of waiting for predefined alerts, you can explore your system's behaviour, follow a trail of evidence, and diagnose novel problems. You achieve observability by instrumenting your systems to produce rich, structured data -- not just "is this metric above a threshold?"

You need both

Monitoring catches known problems fast. Observability lets you diagnose unknown problems. A mature system has monitoring for automated alerting and observability for human investigation. Start with monitoring. Add observability as your system grows in complexity.

The three pillars of observability

Observability rests on three types of telemetry data. Each one provides a different lens for understanding system behaviour. Together, they give you the complete picture.

Pillar 1: Metrics

Metrics are numerical measurements collected at regular intervals. They are the most efficient type of telemetry data -- small in size, fast to query, and easy to aggregate.

Examples of metrics:

CPU utilisation: 72%
Request latency (p99): 245ms
Active database connections: 18/100
HTTP 5xx error count: 4 in the last minute
Memory usage: 3.2GB / 8GB

Metrics are best for dashboards, alerting, and trend analysis. They tell you what is happening at a high level. When you see a latency spike on a dashboard, that is a metric telling you something changed.

Key tool: Prometheus is the industry standard for metrics collection. It uses a pull-based model where it scrapes metrics from your services at regular intervals. It stores time-series data efficiently and supports a powerful query language called PromQL.

Pillar 2: Logs

Logs are timestamped text records of discrete events. Every time something happens in your application -- a user logs in, a request fails, a background job completes -- a log entry records it.

Examples of log entries:

2024-10-08T14:23:01Z INFO  [api] User login successful user_id=abc123 ip=192.168.1.1
2024-10-08T14:23:02Z ERROR [payment] Payment processing failed order_id=xyz789 error="timeout"
2024-10-08T14:23:05Z WARN  [database] Connection pool at 90% capacity pool_size=90/100

Logs are best for detailed investigation. When metrics tell you something is wrong, logs tell you what specifically happened. They provide the narrative context that metrics lack.

Structured vs unstructured logs: Unstructured logs are plain text. Structured logs use a consistent format (usually JSON) with defined fields. Structured logs are dramatically easier to search, filter, and analyse. Always use structured logging in production.

Key tools: The ELK stack (Elasticsearch, Logstash, Kibana) is the most widely used log management solution. Elasticsearch stores and indexes the logs. Logstash (or Fluentd/Fluent Bit) collects and transforms them. Kibana provides the search interface and visualisation. Alternatives include Loki (from the Grafana team), which integrates neatly with Prometheus and Grafana.

Pillar 3: Traces

Traces follow a single request as it moves through your system. In a microservices architecture, one user request might touch five or ten different services. A trace records every hop, showing you exactly which service handled the request, how long each step took, and where failures or slowdowns occurred.

Example of a trace:

Request: GET /api/orders/789
  -> API Gateway (2ms)
    -> Auth Service (15ms)
    -> Order Service (120ms)
      -> Database Query (95ms)  <-- bottleneck
      -> Cache Lookup (3ms)
    -> Notification Service (8ms)
Total: 148ms

Without tracing, you would know the request took 148ms but not which service contributed the most latency. With tracing, you can see immediately that the database query in the Order Service is the bottleneck.

Key tools: Jaeger and Zipkin are the most popular open-source distributed tracing tools. Both support the OpenTelemetry standard, which is the emerging industry standard for telemetry collection. If you are starting fresh, use OpenTelemetry for instrumentation and Jaeger for the tracing backend.

The four golden signals

Google's Site Reliability Engineering book defines four metrics that every service should track. If you monitor nothing else, monitor these.

1. Latency

How long requests take to process. Track both successful and failed requests separately -- a fast error is different from a slow success.

What to track:

p50 (median): What most users experience
p95: What the slowest 5% of users experience
p99: What the slowest 1% of users experience
Error latency: How quickly failed requests return

Why it matters: Users notice latency before they notice anything else. A 100ms increase in page load time can reduce conversion rates by 7%. Latency is your most user-facing metric.

2. Traffic

How much demand is being placed on your system. For a web service, this is typically HTTP requests per second. For a message queue, it is messages per second. For a database, it is queries per second.

What to track:

Requests per second (overall and per endpoint)
Requests by status code (2xx, 4xx, 5xx)
Requests by method (GET, POST, PUT, DELETE)

Why it matters: Traffic patterns reveal usage patterns, help with capacity planning, and provide context for other signals. A latency spike during a traffic spike means something different from a latency spike during normal traffic.

3. Errors

The rate of requests that fail. This includes explicit failures (HTTP 5xx responses), implicit failures (HTTP 200 with wrong content), and policy failures (responses slower than a committed SLA).

What to track:

Error rate (errors / total requests)
Errors by type (5xx, timeouts, application-specific errors)
Error rate by endpoint (some endpoints may be failing while others are fine)

Why it matters: Error rate is your most direct signal of user impact. A 5% error rate means 1 in 20 users is having a bad experience. Detect errors fast and you can fix them before most users are affected.

4. Saturation

How full your resources are. CPU, memory, disk, network bandwidth, database connections -- every resource has a limit, and performance degrades as you approach that limit.

What to track:

CPU utilisation (per instance and aggregate)
Memory usage (used, cached, available)
Disk usage and I/O
Network bandwidth
Database connection pool utilisation
Queue depth (for message queues)

Why it matters: Saturation is your early warning system. If your database connection pool is at 85%, you know you need to act before it hits 100% and starts rejecting requests. Saturation metrics let you address problems before they become outages.

The tool stack

Here is the standard open-source observability stack used by the majority of DevOps teams in production.

Prometheus + Grafana (metrics)

Prometheus collects and stores time-series metrics. You define what to scrape, how often, and what alert rules to apply. It is the de facto standard for Kubernetes and cloud-native monitoring.

Grafana visualises the data Prometheus collects. It provides dashboards, graphs, and alerting through a web interface. Grafana also supports dozens of other data sources, so it can serve as a unified dashboard for metrics, logs, and traces.

This combination is free, open-source, and used by companies of every size from startups to enterprises.

ELK Stack (logs)

Elasticsearch stores and indexes logs for fast searching. Logstash (or Fluent Bit / Fluentd) collects logs from your services and forwards them to Elasticsearch. Kibana provides the search interface and dashboards.

The ELK stack handles billions of log entries per day in production. It is powerful but resource-intensive -- Elasticsearch needs significant memory and CPU to operate efficiently.

Lighter alternative: Grafana Loki is a log aggregation system designed to be cost-effective and simple. It indexes only the metadata (labels) rather than the full log content, making it much cheaper to run. It integrates seamlessly with Grafana, giving you metrics and logs in the same interface.

Jaeger (traces)

Jaeger provides distributed tracing for microservices. Instrument your services with OpenTelemetry, send trace data to Jaeger, and use its web interface to visualise request flows, identify bottlenecks, and debug latency issues.

Jaeger is particularly valuable in microservices architectures where a single request crosses multiple service boundaries. Without tracing, debugging latency in these systems is guesswork.

Alerting best practices

Monitoring is only useful if it tells you about problems. But bad alerting is worse than no alerting -- it creates alert fatigue, where teams ignore alerts because most of them do not require action.

Alert on symptoms, not causes

Bad: Alert when CPU exceeds 80%. CPU can spike to 90% during normal traffic peaks without any user impact.

Good: Alert when p99 latency exceeds 500ms or error rate exceeds 1%. These directly indicate that users are having a bad experience.

Symptom-based alerting reduces noise because you only get alerted when users are actually affected.

Use severity levels

Not every alert needs the same response:

Critical (page): Service is down or users are significantly impacted. Requires immediate response. Example: error rate above 5% for more than 2 minutes.
Warning (notify): Something is degraded but not critical. Requires attention during business hours. Example: disk usage above 80%.
Informational (log): Worth noting but does not require action. Example: deployment completed successfully.

Require alert runbooks

Every alert should link to a runbook that describes: what the alert means, what to check first, and how to resolve common causes. When someone gets paged at 3am, they should not need to figure out what to do from scratch.

Review alerts regularly

Once a month, review all alerts that fired. For each alert, ask: Did this require human action? If not, the alert should be removed or tuned. Alert fatigue is real and dangerous -- teams that receive too many false alarms stop responding to real ones.

Building dashboards that help

A dashboard should answer a question at a glance. If you need to think about what a graph means, the dashboard needs work.

The four-dashboard pattern

Most teams need four dashboards:

Service overview: Golden signals for all services on one screen. Green means healthy. Red means investigate. This is the dashboard on the office TV.
Service detail: Deep dive into a single service. Request rate, latency percentiles, error breakdown, resource utilisation. This is the dashboard you open when the overview shows a problem.
Infrastructure: Node-level metrics. CPU, memory, disk, network for every server or container. Useful for capacity planning and debugging resource-level issues.
Business metrics: User signups, orders, revenue, or whatever your application's key business metrics are. This connects technical health to business outcomes.

Dashboard design principles

Put the most important information at the top left. That is where eyes go first.
Use consistent colours. Green for healthy, yellow for warning, red for critical. Do not make people learn a colour scheme.
Show rates, not totals. "5 errors per second" is more useful than "18,000 errors today."
Include time context. Show the last hour by default with the ability to zoom out. A latency graph without time context is meaningless.
Fewer panels, more clarity. A dashboard with 50 graphs helps nobody. Start with 6-8 panels per dashboard and add only when needed.

Understanding monitoring and observability is essential for operating production systems reliably. It sits alongside infrastructure management, CI/CD pipelines, and container orchestration as a core DevOps competency. For the complete picture of the DevOps toolchain, see the DevOps tools guide. If you are building your DevOps career from scratch, the DevOps career path roadmap maps out where monitoring skills fit in the broader journey.

Monitoring and Observability 101: Why It Matters

Monitoring vs observability

Monitoring

Observability

You need both

The three pillars of observability

Pillar 1: Metrics

Pillar 2: Logs

Pillar 3: Traces

The four golden signals

1. Latency

2. Traffic

3. Errors

4. Saturation

The tool stack

Prometheus + Grafana (metrics)

ELK Stack (logs)

Jaeger (traces)

Alerting best practices

Alert on symptoms, not causes

Use severity levels

Require alert runbooks

Review alerts regularly

Building dashboards that help

The four-dashboard pattern

Dashboard design principles

Frequently Asked Questions

What is the difference between monitoring and observability?

What are the three pillars of observability?

What are the four golden signals?

Which monitoring tools should I learn first?

How much monitoring is enough?

Related Articles

CI/CD Pipeline Tutorial for Absolute Beginners

Ansible vs Terraform: When to Use Which

Linux Commands Every DevOps Engineer Should Know