What is change failure rate?
Change failure rate (CFR) is the percentage of deployments to production that result in a failure requiring remediation, such as a rollback, a hotfix, or a patch. It is one of the four key metrics identified by Google's DORA (DevOps Research and Assessment) research program as a reliable indicator of software delivery performance, alongside deployment frequency, lead time for changes, and mean time to recovery.
The formula is straightforward: divide the number of failed deployments by the total number of deployments over a given period. If a team deploys 100 times in a month and 8 of those deployments cause incidents that require rollback or hotfix, the change failure rate is 8%. Elite-performing teams, according to DORA research, maintain change failure rates of 0-15%, while low-performing teams can see rates of 46-60%.
What makes change failure rate particularly valuable as a metric is that it captures both reliability and velocity in a single number. A team that deploys infrequently and never breaks anything might appear to have a 0% CFR, but they are likely batching changes into large, risky releases. A team that deploys constantly and breaks things half the time has a velocity problem masquerading as a reliability problem. CFR works best when read alongside deployment frequency: a low CFR with high deployment frequency is the goal.
How do you measure change failure rate accurately?
Measuring change failure rate seems simple, but accurate measurement is harder than most teams expect. The core challenge is reliably correlating deployment events with production incidents. This requires two things: a clear record of when deployments happened, and a clear definition of what constitutes a "failure."
Defining what counts as a failure. This is where teams most commonly struggle. The obvious failures are easy: a deployment causes a full outage, error rates spike to 50%, and the team rolls back. But what about a deployment that causes a 1% increase in 422 errors on a single endpoint? What about a deployment that degrades performance for a specific customer segment but not others? What about a deployment that works perfectly for 48 hours and then causes a memory leak that triggers an incident on day three?
One team experienced a deployment that introduced a subtle change to HTTP response handling. The change caused a small percentage of requests to return 422 errors instead of the expected 200 responses. The overall error rate stayed within the normal range that their alerting thresholds were configured for, because the absolute number of affected requests was small. The issue was not detected until customers began complaining through support channels. Was this a deployment failure? By any reasonable definition, yes. But their automated systems did not flag it as one.
This illustrates a critical point: "support blows up" is a lagging indicator, not a measurement. If your primary mechanism for detecting deployment failures is customer complaints, your measured change failure rate is an undercount. The actual CFR is higher than what you are tracking because you are only counting the failures that are severe enough or prolonged enough for users to report them.
Correlating deployments with incidents. Accurate CFR measurement requires linking each incident to the specific deployment that caused it. This is straightforward when an incident occurs within minutes of a deployment, but many failures are delayed. A deployment might introduce a race condition that only manifests under high concurrency, which might not occur until the next traffic peak. A database query regression might not become apparent until the table grows past a size threshold days later.
Mature organizations use deployment event records (timestamps, commit hashes, service names) and correlate them with incident records (start time, affected services, root cause). This correlation is often done manually during postmortems, which is fine for calculating historical CFR but does not help with real-time detection.
Distinguishing deployment failures from environmental failures. Not every incident that occurs after a deployment is caused by the deployment. An upstream provider might have an outage. A cloud region might experience degraded network performance. A scheduled batch job might collide with the deployment window. Accurate CFR measurement requires investigation into root cause, not just temporal correlation. This means that CFR is inherently a lagging metric: you cannot calculate it accurately until postmortems are complete and root causes are established.
Tracking across different services and pipelines. In microservice architectures, a single "deployment" might involve changes to multiple services. If Service A and Service B are both deployed at 2:00 PM and an incident occurs at 2:15 PM, which deployment caused it? Accurate CFR requires granular tracking at the service level, not just a count of "deployments that happened today."
Firetiger measures change failure rate by correlating GitHub deployment events with production anomalies detected by its agents, giving teams an automated and continuous view of their CFR. For teams just starting to measure CFR, the practical advice is: start with a simple definition (any deployment followed by a rollback, hotfix, or incident within 24 hours) and refine from there. An imprecise measurement that you track consistently is more valuable than a perfect definition that you never implement.
How can you reduce change failure rate without slowing down?
The naive approach to reducing change failure rate is to deploy less often or to add more gates before deployment: more manual review, more testing stages, more approval steps. This works in the narrow sense that fewer deployments means fewer failures, but it fails as a strategy because it sacrifices velocity, increases batch size (which increases risk per deployment), and creates organizational bottlenecks.
The better approach is to reduce CFR by improving the speed and accuracy of detection, not by reducing the number of changes. Here are the strategies that work.
Automated deployment verification. When every deployment is automatically checked against customer-facing health signals, failures are caught faster and with higher accuracy. Instead of relying on a human to notice a problem or waiting for a customer complaint, the system detects regressions in sign-up rates, API success rates, latency percentiles, and database performance within minutes of deployment. Faster detection does not prevent the failure from happening, but it dramatically reduces the impact, and reduced impact is what ultimately matters. A deployment that causes a 1% error increase for three minutes before being automatically rolled back is a very different kind of "failure" than one that runs for eight hours before someone notices.
Progressive rollouts limiting blast radius. When changes are rolled out to a small percentage of traffic first, failures affect fewer users. A canary deployment that sends 5% of traffic to the new version means that even a severe bug in the new code only affects 5% of users. If verification catches the problem during the canary phase, the overall impact is minimal. Progressive rollouts do not change the CFR numerator (the deployment still "failed"), but they change the severity of failure so dramatically that the practical effect is similar to preventing the failure entirely.
Agent-driven monitoring tailored to each deployment. Static monitoring checks the same signals for every deployment, regardless of what changed. Agent-driven monitoring reads the code change, understands what it is supposed to do, and generates targeted verification checks for that specific deployment. This catches the category of subtle failures that generic monitoring misses: the 422 error on a specific endpoint, the performance regression in a particular query path, the feature that was supposed to improve latency but did not. For example, Firetiger generates monitoring plans by reading pull request descriptions and code diffs, then crafting specific expectations for what should change and what should not change after deployment. This kind of targeted verification catches failures that would otherwise slip through until customers report them.
Smaller, more frequent changes. This is counterintuitive but well-established: deploying more often with smaller changes typically reduces CFR rather than increasing it. Smaller changes are easier to review, easier to test, easier to understand, and easier to roll back. They also make root cause analysis trivial: if a deployment contains a single small change and something breaks, you know exactly what caused it. Large, batched deployments with dozens of changes make it nearly impossible to identify which change caused a regression without extensive investigation.
Faster detection equals smaller blast radius equals lower effective failure rate. This is the fundamental insight that connects detection speed to CFR. If your mean time to detect a deployment failure is 60 minutes, the average bad deployment affects users for 60 minutes. If you reduce detection time to 5 minutes, the same failure affects users for 5 minutes. The deployment still failed in the technical sense, but the impact, the thing you actually care about, is reduced by 92%. Some teams argue that a failure detected and rolled back within five minutes should not count as a CFR event at all, because the user impact is negligible. Whether you count it or not, the material effect is the same: investing in detection speed is one of the highest-leverage ways to improve deployment reliability.
Post-incident learning loops. Every deployment failure is an opportunity to improve the detection system. When a failure slips through verification, the postmortem should ask not just "what went wrong with the code" but "why did our verification not catch this?" Each answer becomes a new verification check or a refinement of an existing one. Over time, the verification system becomes increasingly comprehensive, catching categories of failures that it previously missed. Teams that consistently close this loop see their CFR decrease steadily over quarters even as their deployment frequency increases.
Where to start
- Define what counts as a "failure": Agree as a team whether you're counting outages only, or also including degradations, rollbacks, and hotfixes.
- Correlate deploy events with incidents: Connect your deployment pipeline to your incident tracker so you can automatically associate incidents with the deploys that caused them.
- Baseline your current CFR: Measure your change failure rate over the last 90 days before trying to improve it -- you need a starting point.
- Deploy outcome-oriented monitoring: Use a platform like Firetiger to continuously measure CFR by correlating deploys with production anomalies detected by AI agents.