Learning Center/DORA Metrics

What is reliability (the fifth DORA metric)?

Reliability is a more recent addition to DORA's framework, sometimes called the fifth DORA metric. It captures roughly what percentage of recent deployments were unplanned and intended to address a user-facing bug, anchoring the other four metrics to actual user experience. Elite performers score very low on this metric; their deploys are intentional, not corrective.

The four original DORA metrics — deployment frequency, lead time for changes, change failure rate, mean time to recovery — measure the delivery system. They tell you how the pipeline behaves. They do not, on their own, tell you whether the system the pipeline is delivering is actually reliable to its users. A team could ship constantly, ship quickly, rarely roll back, and recover fast, and still be running a product that users perceive as flaky.

The reliability metric (sometimes called "operational performance" in the State of DevOps reports) was added in part to close that gap. The DORA survey asks engineers and engineering leaders to estimate, of their recent deploys, the share that were not planned features but were performed specifically to address a user-facing bug. A low number means the deploy pipeline is mostly delivering intended work. A high number means a meaningful share of the team's delivery capacity is going to firefighting. The metric is downstream of all the others — it shows where the rest of the system's behavior actually lands for users.

Why a fifth metric was needed

The fundamental problem with the original four is that none of them measures user experience directly. Deployment frequency, lead time, and change failure rate are pipeline metrics. Mean time to recovery is an incident-response metric. All four assume the team has correctly identified "failure" — but as engineering leaders consistently report, what teams actually care about is customer-impacting incident minutes and SLO attainment, not the count of acknowledged incidents.

Reliability complements the other four by surfacing the failures that did affect users, however they were classified internally. A team with strong DORA scores on the original four and a high reliability number is delivering a system that is, in user terms, less reliable than the pipeline metrics suggest. The combination prompts a useful conversation.

The relationship to SLOs

Service Level Objectives (SLOs) and the DORA reliability metric are pointed at the same target from different angles. SLOs define a numerical reliability budget per service: "99.9% of requests should return successfully," or "99.5% of homepage loads should complete in under two seconds." A team that meets its SLOs is, by definition, delivering reliably according to its own published bar. A team that is burning through its error budget is delivering unreliably and should slow deploys, harden the system, or both.

DORA's reliability metric is a coarser version of the same idea. Where SLOs measure each service against a per-service target, the DORA metric measures the team's deploys against the criterion "was this deploy a fix for a user-facing bug?" The two coexist comfortably: SLOs are the tool for operating the system day-to-day; the DORA metric is the tool for evaluating the delivery system in aggregate.

For deeper treatment of how SLOs work, see What are SLOs, SLIs, and SLAs?.

How most tools measure reliability today

The State of DevOps survey collects this metric by asking practitioners to estimate it. That works for research but does not work for ongoing measurement inside a single team. Vendors that ship reliability-style metrics typically derive them from incident counts: how many PagerDuty incidents in the last 30 days, weighted by severity, normalized by deploy count.

This approach has two well-known weaknesses. First, it depends on incident tickets being filed promptly and severities being assigned accurately, which is widely understood to be inconsistent in practice — engineering leaders commonly augment quantitative DORA with engineer-perception surveys precisely because the quantitative numbers do not capture how the team experiences reliability. Second, it counts incidents that crossed the team's threshold for filing a ticket but misses the longer tail of small-but-real degradations that users notice and silently work around.

A telemetry-grounded approach measures reliability from the same signals SLOs already use: error rates, latency percentiles, and successful-transaction counts on user-facing paths. Customer-impacting incident minutes — the total time during which a service was outside its SLO — is a more honest reliability proxy than the count of tickets filed. It is also harder to game, because the signal is the system's behavior rather than the team's ticket hygiene.

How Firetiger measures reliability

Firetiger reads each PR diff, generates a deployment-specific monitoring plan, watches the deployment across staging, canary, and production, detects regressions, and investigates root cause. The same Indicators that drive Change Monitor verdicts also serve as the inputs for reliability: when an Indicator is bound to an Objective (an SLO target), Firetiger can derive both the SLO attainment percentage and the per-deploy "was this an unplanned fix for a user-facing issue" classification.

A deploy that triggers a Change Monitor regression on a user-facing Indicator counts toward the reliability denominator as a failure. A subsequent deploy that restores the Indicator and was authored specifically to address that regression counts as an unplanned fix. Both classifications come from telemetry — the Indicator's behavior — rather than from human judgment about whether to file an incident.

This sidesteps the most common complaint about reliability scoring: that the number reflects ticket discipline more than system behavior. It also matches the framing engineering leaders use in practice — customer-impacting incident minutes and SLO attainment — rather than a derived score with weights nobody trusts.

Where to start

  • Define what user-facing reliability means. For most teams this is one or two Indicators per critical service (request success rate, p95 latency, end-to-end transaction completion). Write them down before measuring anything.
  • Set Objectives (SLOs) before scoring reliability. Without a target, "reliability" is just a number; with one, it is a budget you can spend and a question you can answer.
  • Prefer customer-impact minutes to incident counts. Minutes-outside-SLO is harder to game and more directly reflects user experience than the number of incidents filed.
  • Read reliability alongside change failure rate. The two are closely related but not identical. Reliability captures the user-experienced consequence; CFR captures the delivery system's behavior. Both matter.

Firetiger uses AI agents to monitor production, investigate incidents, and optimize infrastructure — autonomously. Learn more about Firetiger, get started free, or install the Firetiger plugin for Claude or Cursor.