Why ticket-based DORA metrics fall short

Most DORA tools compute change failure rate from incident tickets, revert detection, or PM-tool labels, and they compute mean time to recovery as the lifecycle of an incident ticket. These approaches are the industry default and they are well-established, but each one depends on human discipline and misses categories of real failures. Telemetry-grounded measurement reads the system's behavior directly and is harder to distort.

Two of the four DORA metrics — change failure rate and mean time to recovery — are about failure. To compute them, a tool needs to know which deployments caused problems and when those problems were resolved. The data sources used by most vendors today are not the systems where the problems happened. They are the systems where humans recorded that the problems happened. The difference matters.

This article is intended as a factual look at how the major approaches work, what each captures well, and where each one breaks down. It is not a claim that telemetry-grounded measurement is universally correct, or that ticket-based measurement is useless. Both are reasonable defaults under different conditions. But the tradeoffs are worth understanding before a team commits to one approach for years.

The three common change-failure-rate methods

There are three approaches in production use today, listed roughly in order of how common they are.

Method 1: revert and rollback pattern matching

The simplest mechanism is to scan git history and treat any commit whose message contains "revert," "rollback," or "hotfix" as evidence that the previous deploy caused a failure. The deploy preceding the revert is counted as a failed deploy; the change failure rate is the ratio. Several major DORA tools ship this as a default detection rule.

What it captures: This method picks up the obvious failures — the kind where the on-call engineer or the deploy pipeline rolled back to the prior version. It requires no integration beyond git access. It is the cheapest mechanism to ship.

What it misses: Fix-forwards. A team that responds to a regression by deploying a new fix rather than reverting will see no "revert" commit at all, and the failure goes uncounted. Teams that practice continuous deployment fix-forward routinely, which means this method systematically under-counts failures for the teams whose delivery practices are most modern. The method is also trivially gameable: omit the word "revert" from the commit message and the failure disappears.

Method 2: PagerDuty and incident-tool correlation

The second approach is to take the commit SHA of each production deployment and look for incidents opened in PagerDuty (or another incident-management tool) within a time window — typically 24 to 72 hours. A deployment with at least one incident in its window is counted as a failed deploy.

What it captures: Failures that crossed the team's threshold for paging the on-call engineer. Severity-tagged incidents can be filtered to count only high-severity failures, which is closer to the spirit of the DORA metric than the raw revert count.

What it misses: Anything that did not page someone. This is the larger problem with the method, and it splits into two cases. First, silent regressions — degradations that affected users but did not breach alerting thresholds — are not counted at all. Engineering leaders have described production deployments that caused subtle changes in HTTP response handling (extra 422 errors on specific endpoints, for instance) and that ran for hours or days before customers complained, because the absolute error rate stayed below the alert threshold. None of these show up. Second, the method is sensitive to ticket-filing discipline. A team that pages itself aggressively will look like it has a high change failure rate; a team that handles small issues without paging will look like it has a low one. Same underlying reality, different measured numbers.

Method 3: PM-tool labels

Some tools rely on tickets labeled with a specific string in Jira or Linear — "production incident," "post-deploy bug," or similar — as the failure signal. One vendor's documentation describes label discipline as "essential" for the metric to work.

What it captures: Whatever the team has agreed to label. For teams with strong labeling practices, this can be the most curated signal of the three.

What it misses: Whatever the team has not labeled. This is the most discretionary of the three methods, and the one most directly dependent on human follow-through.

Vendors that ship these mechanisms have been increasingly open about the limitations. At least one DORA vendor publicly states in their own documentation that incident-based approaches "miss failures that don't trigger tickets, and rely on accurate incident categorization that teams often struggle with at scale." The candor is welcome. It also points to where the room for improvement is.

Mean time to recovery as ticket lifecycle

MTTR is computed almost universally as the time between when an incident ticket is opened and when it is closed. Vendor documentation typically defines it explicitly: "the total duration an incident spends in the active state." No major DORA vendor today computes recovery time from the system's actual return to baseline.

This produces two predictable distortions. Engineers who fix the system quickly but update the ticket slowly look worse than engineers who fix slowly and close the ticket fast — the metric measures Jira hygiene more than system behavior. And the metric assumes that the moment a ticket is closed, the system is healthy, which is not always true. Tickets get closed when the on-call engineer is confident enough to log off; the system can still be slightly degraded.

There is also a more fundamental observation here that is worth surfacing: improving MTTR on incidents you already detect has diminishing returns. Engineering leaders consistently note that shaving a major incident from 15 minutes to 10 minutes does not change much for the business. The much larger lever is detecting incidents you would not have otherwise detected at all — the silent regressions that the ticket-based denominator never counted. A measurement system that surfaces those raises both the CFR numerator and the actionable list of failures to investigate. A measurement system that does not surface them produces flattering numbers that hide real problems.

The configuration burden

Beyond the measurement accuracy problem, ticket-based DORA tools share a configuration-burden problem that is independently expensive.

The customer is typically required to do all of the following before useful numbers appear:

Map each repo to one or more services. This is most painful in monorepos, where regex globs on file paths are used as a stand-in for a real service taxonomy.
Define what "production" means per service: which branch, which environment, which deploy job.
Classify incident severities and decide which severities count as "real" failures.
Tag CI workflows as deploy workflows so the tool can distinguish a deploy from a non-deploy job.
Map PagerDuty (or equivalent) severities to DORA failure categories.

Each of these is reasonable in isolation. Together they represent weeks-to-months of onboarding effort, and they are where most DORA implementations stall and produce unreliable numbers until someone goes back and fixes the configuration. Engineering leaders routinely describe building homebrew ETL pipelines (GitHub APIs to a data warehouse, custom scripts to join deploys with incidents) just to get clean inputs. Many of these projects don't reach the point where the team trusts the dashboard.

What production-reality-derived measurement looks like

The alternative is to measure failure and recovery from the system's behavior directly.

For change failure rate, that means: "did the services touched by this deploy show a meaningful behavioral change — error-rate spike, latency regression, saturation breach — in a window after the deploy?" The signal is the system's telemetry, not a ticket. This catches the silent regressions that none of the three ticket-based methods see, and it does not depend on revert detection or label discipline.

For mean time to recovery, that means: "from the first telemetry deviation to the metric returning to baseline and staying there." Recovery is observed, not declared. An engineer who fixes the system in twelve minutes is credited with twelve minutes, regardless of when they update the ticket.

For service mapping, that means: services are identified from trace tags (service.name, service.version, commit SHA) rather than from customer-supplied YAML. The mapping is in the data, not in a separate configuration document.

There is a useful historical analogy here. Before APM (application performance monitoring) became standard, application performance was measured by user complaints and load-test extrapolation. After APM, performance was measured from actual traces and metrics. The shift was not "we built a better dashboard"; it was "we changed the source of truth from human reports to system observation." The DORA stability metrics — CFR and MTTR — are in a similar position today. They are widely measured from ticket archaeology. The same shift is available.

How Firetiger fits

Firetiger reads each PR diff, generates a deployment-specific monitoring plan, watches the deployment across staging, canary, and production, detects regressions, and investigates root cause. The Change Monitor verdict — issued for each deploy based on observed telemetry — is the same signal that drives DORA's change failure rate and the same signal that anchors the MTTR window. Service mapping comes from trace tags, not from customer YAML. There is no separate DORA build inside Firetiger; the numbers fall out of the data already being collected to do change-aware production monitoring.

This is not a claim that telemetry-grounded measurement is always right or that ticket-based tools are always wrong. Tickets remain valuable for the things they were designed for — coordination, customer communication, postmortem records. The argument is narrower: when measuring whether a deploy caused a failure and when the failure recovered, the system's telemetry is a more direct source of truth than the ticket archaeology that derives from it.

Where to start

Look at your current CFR and ask what method produced it. Revert detection, PagerDuty correlation, and Jira labels each have different blind spots; knowing which one you are using tells you what your number is and is not capturing.
Sample a week's deploys manually. For each one, ask: did this deploy cause measurable user impact? Compare the answer against what the dashboard says. Most teams find a meaningful gap.
Decide what counts as recovered. If you measure MTTR as ticket-close time, ask what the corresponding telemetry would say. The difference is the gaming surface.
Treat configuration burden as a signal. A DORA tool that requires weeks of mapping work to produce numbers is taxing the same engineering hours it is meant to measure.