What is change failure rate?

Change failure rate (CFR) is the percentage of production deployments that cause a failure requiring remediation — a rollback, a hotfix, or a patch. It is one of DORA's four key metrics and works as a stability counterweight to deployment frequency: a low CFR with high deploy frequency is the goal, not a low CFR achieved by shipping rarely.

Why it matters

DORA's State of DevOps research consistently puts elite-team change failure rate in the 0–15% band and low performers at 46–60% — roughly a 3-4× gap. The gap compounds quarter over quarter: more failed changes drive more incidents, more rollbacks, slower restore times, and more engineering time spent on remediation rather than new work. The number reported on a dashboard is only as honest as the method used to count failures, though, and across teams Firetiger has audited, the most common measurement approach (ticket archaeology) typically misses 30-40% of real failures. Reading CFR from production telemetry rather than from ticket logs is what turns the metric from a vanity number into something the team can actually act on. See Why ticket-based DORA metrics fall short for the underlying problem.

The formula is straightforward: divide the number of failed deployments by the total number of deployments over a given period. If a team deploys 100 times in a month and 8 of those deployments cause incidents requiring rollback or hotfix, CFR is 8%. What makes CFR particularly valuable is that it captures both reliability and velocity in a single number. A team that deploys infrequently and never breaks anything might appear to have a 0% CFR, but they are likely batching changes into large, risky releases. A team that deploys constantly and breaks things half the time has a velocity problem masquerading as a reliability problem.

How do you measure change failure rate accurately?

Measuring change failure rate seems simple, but accurate measurement is harder than most teams expect. The core challenge is reliably correlating deployment events with production incidents. This requires two things: a clear record of when deployments happened, and a clear definition of what constitutes a "failure."

Defining what counts as a failure. This is where teams most commonly struggle. The obvious failures are easy: a deployment causes a full outage, error rates spike to 50%, and the team rolls back. But what about a deployment that causes a 1% increase in 422 errors on a single endpoint? What about a deployment that degrades performance for a specific customer segment but not others? What about a deployment that works perfectly for 48 hours and then causes a memory leak that triggers an incident on day three?

One team experienced a deployment that introduced a subtle change to HTTP response handling. The change caused a small percentage of requests to return 422 errors instead of the expected 200 responses. The overall error rate stayed within the normal range that their alerting thresholds were configured for, because the absolute number of affected requests was small. The issue was not detected until customers began complaining through support channels. Was this a deployment failure? By any reasonable definition, yes. But their automated systems did not flag it as one.

This illustrates a critical point: "support blows up" is a lagging indicator, not a measurement. If your primary mechanism for detecting deployment failures is customer complaints, your measured change failure rate is an undercount. The actual CFR is higher than what you are tracking because you are only counting the failures that are severe enough or prolonged enough for users to report them.

Correlating deployments with incidents. Accurate CFR measurement requires linking each incident to the specific deployment that caused it. This is straightforward when an incident occurs within minutes of a deployment, but many failures are delayed. A deployment might introduce a race condition that only manifests under high concurrency, which might not occur until the next traffic peak. A database query regression might not become apparent until the table grows past a size threshold days later.

Mature organizations use deployment event records (timestamps, commit hashes, service names) and correlate them with incident records (start time, affected services, root cause). This correlation is often done manually during postmortems, which is fine for calculating historical CFR but does not help with real-time detection.

Distinguishing deployment failures from environmental failures. Not every incident that occurs after a deployment is caused by the deployment. An upstream provider might have an outage. A cloud region might experience degraded network performance. A scheduled batch job might collide with the deployment window. Accurate CFR measurement requires investigation into root cause, not just temporal correlation. This means that CFR is inherently a lagging metric: you cannot calculate it accurately until postmortems are complete and root causes are established.

Tracking across different services and pipelines. In microservice architectures, a single "deployment" might involve changes to multiple services. If Service A and Service B are both deployed at 2:00 PM and an incident occurs at 2:15 PM, which deployment caused it? Accurate CFR requires granular tracking at the service level, not just a count of "deployments that happened today."

Firetiger measures change failure rate from production telemetry rather than from ticket archaeology. The Change Monitor — which reads each PR diff, generates a deployment-specific monitoring plan, watches the deployment across staging, canary, and production, and detects regressions — issues a verdict per deploy based on observed behavior of the services owning the changed code. That verdict is the CFR signal, which means the metric does not depend on revert-commit pattern matching, on a human filing the right PagerDuty incident, or on someone applying the right Jira label. For a fuller comparison of the three industry approaches and their tradeoffs, see Why ticket-based DORA metrics fall short.

For teams just starting to measure CFR, the practical advice is: start with a simple definition (any deployment followed by a rollback, hotfix, or incident within 24 hours) and refine from there. An imprecise measurement that you track consistently is more valuable than a perfect definition that you never implement.

How can you reduce change failure rate without slowing down?

The naive approach to reducing change failure rate is to deploy less often or to add more gates before deployment: more manual review, more testing stages, more approval steps. This works in the narrow sense that fewer deployments means fewer failures, but it fails as a strategy because it sacrifices velocity, increases batch size (which increases risk per deployment), and creates organizational bottlenecks.

The better approach is to reduce CFR by improving the speed and accuracy of detection, not by reducing the number of changes. Here are the strategies that work.

Automated deployment verification. When every deployment is automatically checked against customer-facing health signals, failures are caught faster and with higher accuracy. Instead of relying on a human to notice a problem or waiting for a customer complaint, the system detects regressions in sign-up rates, API success rates, latency percentiles, and database performance within minutes of deployment. Faster detection does not prevent the failure from happening, but it dramatically reduces the impact, and reduced impact is what ultimately matters. A deployment that causes a 1% error increase for three minutes before being automatically rolled back is a very different kind of "failure" than one that runs for eight hours before someone notices.

Progressive rollouts limiting blast radius. When changes are rolled out to a small percentage of traffic first, failures affect fewer users. A canary deployment that sends 5% of traffic to the new version means that even a severe bug in the new code only affects 5% of users. If verification catches the problem during the canary phase, the overall impact is minimal. Progressive rollouts do not change the CFR numerator (the deployment still "failed"), but they change the severity of failure so dramatically that the practical effect is similar to preventing the failure entirely.

Agent-driven monitoring tailored to each deployment. Static monitoring checks the same signals for every deployment, regardless of what changed. Agent-driven monitoring reads the code change, understands what it is supposed to do, and generates targeted verification checks for that specific deployment. This catches the category of subtle failures that generic monitoring misses: the 422 error on a specific endpoint, the performance regression in a particular query path, the feature that was supposed to improve latency but did not. For example, Firetiger generates monitoring plans by reading pull request descriptions and code diffs, then crafting specific expectations for what should change and what should not change after deployment. This kind of targeted verification catches failures that would otherwise slip through until customers report them.

Smaller, more frequent changes. This is counterintuitive but well-established: deploying more often with smaller changes typically reduces CFR rather than increasing it. Smaller changes are easier to review, easier to test, easier to understand, and easier to roll back. They also make root cause analysis trivial: if a deployment contains a single small change and something breaks, you know exactly what caused it. Large, batched deployments with dozens of changes make it nearly impossible to identify which change caused a regression without extensive investigation.

Faster detection equals smaller blast radius equals lower effective failure rate. This is the fundamental insight that connects detection speed to CFR. If your mean time to detect a deployment failure is 60 minutes, the average bad deployment affects users for 60 minutes. If you reduce detection time to 5 minutes, the same failure affects users for 5 minutes. The deployment still failed in the technical sense, but the impact, the thing you actually care about, is reduced by 92%. Some teams argue that a failure detected and rolled back within five minutes should not count as a CFR event at all, because the user impact is negligible. Whether you count it or not, the material effect is the same: investing in detection speed is one of the highest-leverage ways to improve deployment reliability.

Post-incident learning loops. Every deployment failure is an opportunity to improve the detection system. When a failure slips through verification, the postmortem should ask not just "what went wrong with the code" but "why did our verification not catch this?" Each answer becomes a new verification check or a refinement of an existing one. Over time, the verification system becomes increasingly comprehensive, catching categories of failures that it previously missed. Teams that consistently close this loop see their CFR decrease steadily over quarters even as their deployment frequency increases.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Define what counts as a "failure": Agree as a team whether you're counting outages only, or also including degradations, rollbacks, and hotfixes.
Correlate deploy events with incidents: Connect your deployment pipeline to your incident tracker so you can automatically associate incidents with the deploys that caused them.
Baseline your current CFR: Measure your change failure rate over the last 90 days before trying to improve it -- you need a starting point.
Deploy outcome-oriented monitoring: Use a platform like Firetiger to continuously measure CFR by correlating deploys with production anomalies detected by AI agents.