Observability

A system must expose its internal state to the outside

Observability is the ability to understand the internal state of a complex, distributed system solely by evaluating its external outputs (telemetry data). It goes beyond classic monitoring: the goal is not just to know THAT something is broken, but WHY.

In modern cloud architectures, Observability is a central means of understanding failures that occur across system boundaries.

Anti-Patterns: Flying Blind in the Cluster

Silo Monitoring: Each team has its own dashboard, but no one sees the complete path of a user request through the system.
Log Spam: Vast amounts of data are collected, but they are not searchable when something breaks or contain no relevant information.
Reactive Alerting: Alerts only fire once the system has already gone down, instead of detecting subtle degradations early (e.g. rising latency).

The Three Pillars of Observability

Metrics: Quantitative data over time (CPU load, request rates, error rates). Well suited for dashboards and alerts.
Logging: Detailed text events. Indispensable for the forensic analysis of individual incidents.
Distributed Tracing: Tracking a single user request across all involved microservices. Shows exactly where time is lost or an error originates.
OpenTelemetry (OTel): Using a vendor-neutral standard for capturing and transmitting telemetry data, avoiding lock-in with monitoring providers.
Service Level Indicators and Objectives (SLI/SLO): Focusing on the metrics that directly reflect the user experience (such as successful checkouts per minute). An SLO sets the target value for such a metric, and alerting is aligned to it rather than reacting to raw infrastructure signals.

The Benefit: Drastically Reduced MTTR

Mean Time To Recovery drops significantly because the system already provides engineers with the facts they need, rather than forcing them to go on a laborious search for clues.

FAQ

Doesn't collecting all this data create too much overhead?

Modern telemetry frameworks like OpenTelemetry are designed for low overhead. Actual cost depends on language, sampling rate, and cardinality. In most cases, the overhead of an unresolved outage far outweighs the cost of telemetry.

Do we now have to write tracing code for every function?

No. Modern frameworks offer Auto-Instrumentation. The base data flows automatically; manual detail is added only where it matters for business logic.

References

OpenTelemetry OpenTelemetry Documentation. Vendor-neutral standard for telemetry data (traces, metrics, logs). (2024). opentelemetry.io
Majors, Fong-Jones, Miranda Observability Engineering. The new era of monitoring for distributed systems. (2022). www.oreilly.com/library/view/observability-engineering/9781492076438/
Google SRE The Golden Signals. The four central metrics of any system: Latency, Traffic, Errors, Saturation. (2016). sre.google/sre-book/monitoring-distributed-systems/

Ask AI

These links open external AI services, the conversation and its content are sent to their providers.