Observability and Monitoring
At 2 AM, payments start failing. The dashboard is green, the on-call engineer is asleep, and the first sign anything is wrong is an angry customer tweet. That gap, between the moment a system breaks and the moment a human understands why, is exactly what observability and monitoring exist to close. Every minute a problem goes undetected costs money, trust, and sometimes the entire weekend of the person paged to fix it.
This category teaches you how to make a system tell you what it is doing. You will work through the three pillars that carry every observability stack: logs (what happened), metrics (how much and how often), and traces (where a request went across services). Around those pillars sit the practices that turn raw signal into action: choosing what to alert on, defining what "healthy" means with SLIs and SLOs, measuring how fast you detect and recover, and wiring it all together with tools like Prometheus, Grafana, OpenTelemetry, Jaeger, and Datadog.
Monitoring Versus Observability
Monitoring answers questions you already know to ask. You decide in advance that CPU above 90 percent is bad, you set a threshold, and you get paged when it crosses. It works well for known failure modes. The problem is that distributed systems fail in ways nobody predicted, and a dashboard built for yesterday's outage tells you nothing about today's.
Observability is the broader goal: building a system whose internal state you can understand from the outside, even for failures you never anticipated. The Observability Overview lesson frames the difference, and it comes down to one test. When something breaks in a way you have never seen, can you ask new questions of your data and get answers without shipping new code? If yes, your system is observable. If you have to add logging and redeploy to understand an incident, you only have monitoring.
The practical bridge between the two is the three pillars. Logs give you detailed, timestamped records of discrete events. Metrics give you cheap, aggregatable numbers you can graph and alert on. Traces stitch a single request together as it hops across services. None of the three alone is enough. Logs without metrics drown you in detail; metrics without traces tell you that latency rose but not which service caused it.
The Three Pillars in Detail
Logs are the most familiar pillar and the easiest to do badly. The lessons on Log Levels, Structured Logging, Log Aggregation, Centralized Logging, and Log Management cover the progression from plain text scattered across servers to structured JSON shipped into one searchable store. The single biggest upgrade most teams can make is structured logging: emit logs as key-value data, not prose, so you can filter on user ID or status code instead of grepping.
Metrics are numbers measured over time, and the type matters. A Counter only goes up (total requests served). A Gauge moves both ways (current memory in use). Histograms and Summaries capture distributions so you can report Percentiles, which is what actually matters for user experience. Average latency lies; the p99 tells you what your slowest one percent of users feel. Time-Series Metrics, Metrics Collection, and Metrics Aggregation cover how these are stored and rolled up.
Traces are the pillar that makes microservices debuggable. A single user action might touch a dozen services, and Distributed Tracing follows it through all of them. The mechanics are Trace IDs, Span IDs, Distributed Context Propagation, and Baggage, plus the closely related Request IDs and Correlation IDs that let you join logs from different services back to one request. Without trace context, a microservices outage is a guessing game across a dozen log streams.
Defining and Measuring Reliability
You cannot improve what you have not defined. SLIs, SLOs, and SLAs are how teams put numbers on reliability. A Service Level Indicator is the raw measurement, like the percentage of requests served under 300 milliseconds. A Service Level Objective is the target you commit to internally, like 99.9 percent of requests under that bound over 30 days. A Service Level Agreement is the external, contractual version with financial penalties attached. The SLA/SLO/SLI Overview lesson ties the three together so you stop using the terms interchangeably.
The most useful idea that falls out of SLOs is the Error Budget. If your SLO is 99.9 percent, you are allowed to fail 0.1 percent of the time, and that allowance is a budget you can spend on risk. Plenty of budget left means ship faster. Budget exhausted means freeze features and fix reliability. It turns reliability from an argument into a number both engineers and product managers can read.
The time-based metrics quantify how well your incident response works. MTTD measures how long until you notice, MTTA how long until someone acknowledges the page, MTTR how long until it is resolved, and MTBF how long the system runs between failures. Alongside Availability Metrics, Uptime Monitoring, and Downtime Tracking, these are the numbers leadership asks about after every outage.
Knowing What to Watch and What to Alert On
Instrument everything and you will alert on nothing useful, because the noise buries the signal. The methodologies in this category exist to focus your attention. The Four Golden Signals (also taught as Golden Signals) are latency, traffic, errors, and saturation, the four numbers Google's SRE practice watches first on any service. The RED Method (Rate, Errors, Duration) is tuned for request-driven services, while the USE Method (Utilization, Saturation, Errors) is tuned for resources like CPU, disk, and memory.
User-facing measurement gets its own tools. APM watches your application code end to end, RUM captures what real users actually experience in their browsers, and Synthetic Monitoring runs scripted checks against your endpoints around the clock so you catch breakage before a real user does. The Apdex Score rolls user-perceived performance into a single 0-to-1 number that non-engineers can track.
Good Alerting is the discipline that ties this together. Alert on symptoms users feel, not on every internal metric. Page a human only for things that need a human right now; everything else goes to a dashboard or a ticket. The lessons on Infrastructure Monitoring, Network Monitoring, and Database Monitoring round out coverage of the layers beneath your application.
The Tooling Landscape
Theory becomes practice through tools, and this category covers the ones you will meet in real jobs. For metrics, Prometheus is the de facto open-source standard for collection and storage, almost always paired with Grafana for dashboards. StatsD and Telegraf are common metric collectors that feed into these systems.
For tracing, Jaeger and Zipkin are the leading open-source backends, and OpenTelemetry is the vendor-neutral standard that increasingly feeds all of them. OpenTelemetry matters because it lets you instrument your code once and send the data to whichever backend you choose, so you are not locked into one vendor's agent. For logs, the ELK Stack (Elasticsearch, Logstash, Kibana) plus shippers like Fluentd and Vector form the classic pipeline, while Splunk is the heavyweight enterprise option.
On the commercial side, Datadog, New Relic, AppDynamics, and Dynatrace bundle metrics, logs, and traces into one platform with less setup, at higher cost. The real-world pattern at most companies is a mix: open-source for high-volume signals where cost matters, a commercial platform where speed of setup and integrated dashboards matter more. The skill is not memorizing tools but knowing which signal each one is good at and how they fit together.