SRE (Site Reliability Engineering)

Reliability is measured and budgeted, not hoped for

SRE (Site Reliability Engineering) is a discipline developed by Google that applies software engineering principles to IT operations. SREs apply software engineering thinking to reliability problems; Google focuses on engineers with rare systems and networking skills, and in practice SREs come from diverse backgrounds. SRE balances reliability against change and innovation speed by making the tradeoff explicit through SLOs and error budgets.

SRE resolves the fundamental dilemma: development teams want to deploy quickly, operations teams want stability. The Error Budget creates a shared framework.

Core concepts

SLI (Service Level Indicator): A measurable reliability metric (e.g. request success rate, latency).
SLO (Service Level Objective): A concrete target for an SLI (e.g. "99.9% of all requests under 200 ms").
Error Budget: The permitted margin of failures within the SLO. At a 99.9% SLO measured time-based over a 30-day month, that is roughly 43 minutes of downtime; request-based SLOs are computed differently. When the budget is exhausted, the agreed policy applies: release slowdown, reliability sprint, or management exception review.
Toil: Manual, repetitive, automatable work in operations. SREs actively measure and reduce toil. Target: below 50% of working time.
Postmortem: Blameless, structured analysis after significant incidents above the team's severity threshold (see Post-Mortem).

SRE vs. DevOps

SRE is a specific implementation of DevOps principles. DevOps is a culture and philosophy; SRE is a concrete role and toolset with defined practices.

Focus: Reliability as a Feature

Reliability is not a technical side topic but a business objective. SRE makes the cost of unreliability visible and creates incentives for sustainable systems.

Error Budget as Moderation Tool

When the Error Budget is exhausted, the agreed policy rules apply. Typical options:

Release slowdown or freeze of new features.
Priority shifts to reliability work.
Management exception review for critical releases.
Once the budget has recovered: normal development pace.

The budget is a negotiation tool, not an accusation. It makes the business's risk tolerance explicit.

FAQ

Do we need a dedicated SRE team?

In smaller organisations, development teams adopt SRE practices (embedded SRE). Dedicated SRE teams make sense once system scale and reliability demands require full-time specialisation.

How do we start with SRE without Google's scale?

Start with one SLO for the most important service. Measure it. Discuss the Error Budget. That is enough for a first step.

References

SLODLC SLO Adoption Guide. Guide for incremental SLO adoption. (2023). www.slodlc.com
Google The SRE Workbook. Practical examples and implementation guides. (2018). sre.google/workbook/table-of-contents/
Google Site Reliability Engineering Book. Foundational text, available for free online. (2016). sre.google/books/

Ask AI

These links open external AI services, the conversation and its content are sent to their providers.