DevOps Tools
Monitoring and Observability 101: Why It Matters
Monitoring tells you when something is broken. Observability tells you why it is broken. You need both, and every production system that serves real users needs them from day one -- not as an afterthought bolted on after the first outage.
The distinction matters because modern systems are complex. A single user request might pass through a load balancer, an API gateway, three microservices, two databases, and a cache before returning a response. When that request fails or slows down, monitoring tells you something is wrong. Observability tells you which service, which function, which database query, and why.
This guide covers the foundations: what to monitor, what tools to use, how to build dashboards that actually help, and how to set up alerting that does not wake you up at 3am for problems that do not matter. For the broader context of how monitoring fits into the DevOps toolchain, see the complete DevOps tools guide.
Monitoring vs observability
These terms are often used interchangeably, but they describe different things.
Monitoring
Monitoring is the practice of collecting predefined measurements from your systems and alerting when those measurements cross a threshold. It answers questions you have already thought of:
- Is the server running?
- Is CPU usage above 80%?
- Are more than 1% of requests returning errors?
- Is the database running out of connections?
Monitoring is reactive. You define what to watch, set thresholds, and wait for alerts. It works well for known problems -- the ones you have already experienced or anticipated.
Observability
Observability is the ability to understand the internal state of a system by examining its external outputs. It answers questions you have not thought of yet:
- Why are requests from users in Europe 3x slower than requests from the US?
- Which specific database query is causing the latency spike?
- Why did the error rate increase after the last deployment, but only for authenticated users?
Observability is proactive. Instead of waiting for predefined alerts, you can explore your system's behaviour, follow a trail of evidence, and diagnose novel problems. You achieve observability by instrumenting your systems to produce rich, structured data -- not just "is this metric above a threshold?"
You need both
Monitoring catches known problems fast. Observability lets you diagnose unknown problems. A mature system has monitoring for automated alerting and observability for human investigation. Start with monitoring. Add observability as your system grows in complexity.
The three pillars of observability
Observability rests on three types of telemetry data. Each one provides a different lens for understanding system behaviour. Together, they give you the complete picture.
Pillar 1: Metrics
Metrics are numerical measurements collected at regular intervals. They are the most efficient type of telemetry data -- small in size, fast to query, and easy to aggregate.
Examples of metrics:
- CPU utilisation: 72%
- Request latency (p99): 245ms
- Active database connections: 18/100
- HTTP 5xx error count: 4 in the last minute
- Memory usage: 3.2GB / 8GB
Metrics are best for dashboards, alerting, and trend analysis. They tell you what is happening at a high level. When you see a latency spike on a dashboard, that is a metric telling you something changed.
Key tool: Prometheus is the industry standard for metrics collection. It uses a pull-based model where it scrapes metrics from your services at regular intervals. It stores time-series data efficiently and supports a powerful query language called PromQL.
Pillar 2: Logs
Logs are timestamped text records of discrete events. Every time something happens in your application -- a user logs in, a request fails, a background job completes -- a log entry records it.
Examples of log entries:
2024-10-08T14:23:01Z INFO [api] User login successful user_id=abc123 ip=192.168.1.1
2024-10-08T14:23:02Z ERROR [payment] Payment processing failed order_id=xyz789 error="timeout"
2024-10-08T14:23:05Z WARN [database] Connection pool at 90% capacity pool_size=90/100
Logs are best for detailed investigation. When metrics tell you something is wrong, logs tell you what specifically happened. They provide the narrative context that metrics lack.
Structured vs unstructured logs: Unstructured logs are plain text. Structured logs use a consistent format (usually JSON) with defined fields. Structured logs are dramatically easier to search, filter, and analyse. Always use structured logging in production.
Key tools: The ELK stack (Elasticsearch, Logstash, Kibana) is the most widely used log management solution. Elasticsearch stores and indexes the logs. Logstash (or Fluentd/Fluent Bit) collects and transforms them. Kibana provides the search interface and visualisation. Alternatives include Loki (from the Grafana team), which integrates neatly with Prometheus and Grafana.
Pillar 3: Traces
Traces follow a single request as it moves through your system. In a microservices architecture, one user request might touch five or ten different services. A trace records every hop, showing you exactly which service handled the request, how long each step took, and where failures or slowdowns occurred.
Example of a trace:
Request: GET /api/orders/789
-> API Gateway (2ms)
-> Auth Service (15ms)
-> Order Service (120ms)
-> Database Query (95ms) <-- bottleneck
-> Cache Lookup (3ms)
-> Notification Service (8ms)
Total: 148ms
Without tracing, you would know the request took 148ms but not which service contributed the most latency. With tracing, you can see immediately that the database query in the Order Service is the bottleneck.
Key tools: Jaeger and Zipkin are the most popular open-source distributed tracing tools. Both support the OpenTelemetry standard, which is the emerging industry standard for telemetry collection. If you are starting fresh, use OpenTelemetry for instrumentation and Jaeger for the tracing backend.
The four golden signals
Google's Site Reliability Engineering book defines four metrics that every service should track. If you monitor nothing else, monitor these.
1. Latency
How long requests take to process. Track both successful and failed requests separately -- a fast error is different from a slow success.
What to track:
- p50 (median): What most users experience
- p95: What the slowest 5% of users experience
- p99: What the slowest 1% of users experience
- Error latency: How quickly failed requests return
Why it matters: Users notice latency before they notice anything else. A 100ms increase in page load time can reduce conversion rates by 7%. Latency is your most user-facing metric.
2. Traffic
How much demand is being placed on your system. For a web service, this is typically HTTP requests per second. For a message queue, it is messages per second. For a database, it is queries per second.
What to track:
- Requests per second (overall and per endpoint)
- Requests by status code (2xx, 4xx, 5xx)
- Requests by method (GET, POST, PUT, DELETE)
Why it matters: Traffic patterns reveal usage patterns, help with capacity planning, and provide context for other signals. A latency spike during a traffic spike means something different from a latency spike during normal traffic.
3. Errors
The rate of requests that fail. This includes explicit failures (HTTP 5xx responses), implicit failures (HTTP 200 with wrong content), and policy failures (responses slower than a committed SLA).
What to track:
- Error rate (errors / total requests)
- Errors by type (5xx, timeouts, application-specific errors)
- Error rate by endpoint (some endpoints may be failing while others are fine)
Why it matters: Error rate is your most direct signal of user impact. A 5% error rate means 1 in 20 users is having a bad experience. Detect errors fast and you can fix them before most users are affected.
4. Saturation
How full your resources are. CPU, memory, disk, network bandwidth, database connections -- every resource has a limit, and performance degrades as you approach that limit.
What to track:
- CPU utilisation (per instance and aggregate)
- Memory usage (used, cached, available)
- Disk usage and I/O
- Network bandwidth
- Database connection pool utilisation
- Queue depth (for message queues)
Why it matters: Saturation is your early warning system. If your database connection pool is at 85%, you know you need to act before it hits 100% and starts rejecting requests. Saturation metrics let you address problems before they become outages.
The tool stack
Here is the standard open-source observability stack used by the majority of DevOps teams in production.
Prometheus + Grafana (metrics)
Prometheus collects and stores time-series metrics. You define what to scrape, how often, and what alert rules to apply. It is the de facto standard for Kubernetes and cloud-native monitoring.
Grafana visualises the data Prometheus collects. It provides dashboards, graphs, and alerting through a web interface. Grafana also supports dozens of other data sources, so it can serve as a unified dashboard for metrics, logs, and traces.
This combination is free, open-source, and used by companies of every size from startups to enterprises.
ELK Stack (logs)
Elasticsearch stores and indexes logs for fast searching. Logstash (or Fluent Bit / Fluentd) collects logs from your services and forwards them to Elasticsearch. Kibana provides the search interface and dashboards.
The ELK stack handles billions of log entries per day in production. It is powerful but resource-intensive -- Elasticsearch needs significant memory and CPU to operate efficiently.
Lighter alternative: Grafana Loki is a log aggregation system designed to be cost-effective and simple. It indexes only the metadata (labels) rather than the full log content, making it much cheaper to run. It integrates seamlessly with Grafana, giving you metrics and logs in the same interface.
Jaeger (traces)
Jaeger provides distributed tracing for microservices. Instrument your services with OpenTelemetry, send trace data to Jaeger, and use its web interface to visualise request flows, identify bottlenecks, and debug latency issues.
Jaeger is particularly valuable in microservices architectures where a single request crosses multiple service boundaries. Without tracing, debugging latency in these systems is guesswork.
Alerting best practices
Monitoring is only useful if it tells you about problems. But bad alerting is worse than no alerting -- it creates alert fatigue, where teams ignore alerts because most of them do not require action.
Alert on symptoms, not causes
Bad: Alert when CPU exceeds 80%. CPU can spike to 90% during normal traffic peaks without any user impact.
Good: Alert when p99 latency exceeds 500ms or error rate exceeds 1%. These directly indicate that users are having a bad experience.
Symptom-based alerting reduces noise because you only get alerted when users are actually affected.
Use severity levels
Not every alert needs the same response:
- Critical (page): Service is down or users are significantly impacted. Requires immediate response. Example: error rate above 5% for more than 2 minutes.
- Warning (notify): Something is degraded but not critical. Requires attention during business hours. Example: disk usage above 80%.
- Informational (log): Worth noting but does not require action. Example: deployment completed successfully.
Require alert runbooks
Every alert should link to a runbook that describes: what the alert means, what to check first, and how to resolve common causes. When someone gets paged at 3am, they should not need to figure out what to do from scratch.
Review alerts regularly
Once a month, review all alerts that fired. For each alert, ask: Did this require human action? If not, the alert should be removed or tuned. Alert fatigue is real and dangerous -- teams that receive too many false alarms stop responding to real ones.
Building dashboards that help
A dashboard should answer a question at a glance. If you need to think about what a graph means, the dashboard needs work.
The four-dashboard pattern
Most teams need four dashboards:
- Service overview: Golden signals for all services on one screen. Green means healthy. Red means investigate. This is the dashboard on the office TV.
- Service detail: Deep dive into a single service. Request rate, latency percentiles, error breakdown, resource utilisation. This is the dashboard you open when the overview shows a problem.
- Infrastructure: Node-level metrics. CPU, memory, disk, network for every server or container. Useful for capacity planning and debugging resource-level issues.
- Business metrics: User signups, orders, revenue, or whatever your application's key business metrics are. This connects technical health to business outcomes.
Dashboard design principles
- Put the most important information at the top left. That is where eyes go first.
- Use consistent colours. Green for healthy, yellow for warning, red for critical. Do not make people learn a colour scheme.
- Show rates, not totals. "5 errors per second" is more useful than "18,000 errors today."
- Include time context. Show the last hour by default with the ability to zoom out. A latency graph without time context is meaningless.
- Fewer panels, more clarity. A dashboard with 50 graphs helps nobody. Start with 6-8 panels per dashboard and add only when needed.
Understanding monitoring and observability is essential for operating production systems reliably. It sits alongside infrastructure management, CI/CD pipelines, and container orchestration as a core DevOps competency. For the complete picture of the DevOps toolchain, see the DevOps tools guide. If you are building your DevOps career from scratch, the DevOps career path roadmap maps out where monitoring skills fit in the broader journey.
Frequently Asked Questions
Ola
Founder, CloudPros
Building the most hands-on DevOps bootcamp for the AI era. 16 weeks of real infrastructure, real projects, real career outcomes.
