What is root cause analysis?

Root cause analysis (RCA) is the systematic process of identifying the fundamental cause of an incident — distinguishing the proximate trigger, contributing factors, and systemic conditions that allowed the failure to occur. It is one of the most labor-intensive activities in incident response. Done well, RCA converts a painful disruption into durable improvements; done poorly, it patches symptoms and guarantees recurrence.

Why it matters

RCA is consistently the highest-leverage post-incident activity and consistently the most under-invested. Across teams Firetiger has worked with, the average RCA takes 4-8 engineering hours of focused work after the incident is over, and roughly a quarter of all RCAs are never completed because the team moves on to the next thing. The Google SRE book calls out the same dynamic: organizations that systematically convert incidents into durable improvements compound their reliability gains, and those that don't keep paying for the same failure in different shapes. The distinction between symptoms, contributing factors, and root causes is the central conceptual move — fixing a symptom without understanding the root cause guarantees the same class of failure will happen again, often at a worse time and with greater impact.

A symptom is the observable impact: users cannot log in, dashboards are blank, data is not being ingested. A contributing factor is a condition that made the failure more likely or more severe but did not directly cause it: a missing alert configuration, a team member out of office, a test environment that didn't replicate production. The root cause is the specific failure that initiated the chain: a race condition in CI, a misconfigured service definition, a code change that introduced a deadlock.

Why is root cause analysis so time-consuming?

RCA is one of the most labor-intensive activities in incident response, routinely consuming more engineering hours than the incident itself. Several factors contribute to this.

Cross-system investigation. Modern production systems span dozens of services, databases, message queues, CDNs, and cloud provider APIs. An incident that manifests as a user-facing error may have originated in a CI pipeline, propagated through an infrastructure-as-code deployment, and become visible only when a container orchestrator attempted a routine task restart. Tracing this path requires querying logs in one system, metrics in another, deployment records in a third, and infrastructure state in a fourth. Each system has its own query language, retention policy, and access controls.

Specialized knowledge concentration. In most organizations, deep understanding of specific subsystems is concentrated in a small number of people. The engineer who understands the CI pipeline's concurrency rules is not the same person who knows the intricacies of the ECS task scheduling. RCA often requires assembling knowledge from multiple specialists, coordinating their schedules, and synthesizing their individual findings into a coherent narrative. When a key expert is unavailable, investigation stalls.

Log volume overwhelms manual review. A busy production system can generate millions of log lines per hour. Searching through this volume for the specific entries that explain an incident is like finding a needle in a haystack, except the haystack is actively growing. Engineers must craft precise queries, iterate on them as they refine their understanding, and cross-reference results across multiple log sources. This process is inherently iterative and slow.

Pressure to fix before understanding. During an active incident, the priority is restoring service. Engineers are rightly focused on mitigation: rolling back the bad deploy, restarting failed services, rerouting traffic. This means that the deep investigation required for RCA is typically deferred until after the incident is resolved, at which point context has faded, adrenaline has worn off, and other work has piled up. The result is that many RCAs are written days or weeks after the incident, with degraded recall of the details.

Overlapping causes obscure the picture. The most challenging incidents involve multiple contributing factors that interact in unexpected ways. A real-world example illustrates this well: one platform experienced an eight-hour outage caused by three overlapping issues. First, a race condition in the CI system caused a build job to be erroneously canceled. Second, the deployment pipeline's artifact attribution logic falsely believed the build had completed, so it proceeded to update service definitions to reference a container image that did not exist. Third, the infrastructure tool partially applied the change, updating one service's configuration while rejecting another, leaving the system in an inconsistent state where existing processes continued running but could not be replaced when they naturally recycled.

The outage did not begin immediately. The existing processes kept running for roughly 25 hours on the old container image. When they eventually restarted, the system could not launch new processes because the service definition pointed to a nonexistent image. To make matters worse, a notification misconfiguration, caused by an engineer working on a separate feature, prevented alerts from reaching the on-call channel. It took eight hours for a human to become aware of the problem.

Untangling these overlapping causes, determining which was the root cause versus a contributing factor, and identifying what would have prevented the incident entirely, is precisely why RCA takes so long.

How can agents automate root cause analysis?

Autonomous agents are increasingly capable of performing the investigative work that makes RCA time-consuming for humans. These agents combine large language models with tool access, allowing them to form hypotheses, query multiple data sources, and iteratively refine their understanding of an incident.

Hypothesis formation and testing. An agent begins with observable symptoms, such as a spike in error rates or a drop in traffic, and generates hypotheses about potential causes. It then queries logs, metrics, and infrastructure state to validate or eliminate each hypothesis. This is fundamentally the same process a human engineer follows, but the agent can execute it faster because it does not need to context-switch, look up query syntax, or wait for colleagues to answer questions.

Multi-source intelligence. The key advantage agents have over static correlation rules is their ability to synthesize information from disparate sources. An agent investigating a service outage might check deployment records to see if a recent release coincided with the problem, examine container orchestrator state to identify which processes failed to start, query the CI system to determine whether the build completed successfully, and inspect infrastructure-as-code state to find partial applies. A human performing the same investigation would need to open four or five different tools and manually connect the dots.

Coalescing related signals. In complex incidents, multiple monitoring systems may independently detect different facets of the same problem. One agent might notice that webhook ingestion is failing. Another might flag a drop in telemetry data volume. A third might detect that API health checks are returning errors. An automated triage system can coalesce these individual findings into a single incident, providing the investigating agent with a richer starting point than any individual signal would offer.

Persistence and thoroughness. Human investigators are subject to fatigue, time pressure, and cognitive bias. They may stop investigating once they find a plausible explanation, even if it is not the complete picture. Agents can be instructed to keep looking, to check for additional contributing factors, and to validate their findings against multiple data sources. One AI inference platform had their monitoring agents produce an RCA that was, in the assessment of a senior engineer, almost identical to what they would have written manually. This suggests that the quality of automated RCA is approaching human expert levels, at least for incidents with clearly observable evidence in logs and metrics.

Structured output. Agents can be instructed to produce RCAs in a consistent format, complete with timeline, impact assessment, root cause chain, and recommended preventive measures. This consistency is valuable because human-written RCAs vary widely in quality and completeness, depending on who writes them and how much time they have. Firetiger reads each PR diff, watches the deployment, detects regressions, and investigates root cause — forming hypotheses, querying across logs, metrics, and traces, and producing structured RCA reports that can be handed back to engineers or to the coding agent that authored the change.

It is worth noting the current limitations. Agents are most effective when the evidence of the root cause exists in queryable systems: logs, metrics, traces, deployment records, infrastructure state. Incidents caused by subtle concurrency bugs, design-level architectural issues, or problems that leave no observable trace in telemetry are still beyond the reach of automated RCA. The agent's investigation is only as good as the observability data available to it.

What makes a good root cause analysis?

Whether performed by a human or an agent, a high-quality RCA shares certain structural properties that distinguish it from a superficial incident summary.

A detailed timeline. The RCA should reconstruct the sequence of events from the initiating cause through to resolution, with timestamps. This timeline is the backbone of the analysis, because it reveals the causal chain: which event preceded which, how long each phase lasted, and where delays occurred. A good timeline includes not just the failure events but also the detection and response events, since delays in detection or response are often contributing factors in their own right.

Impact assessment. The RCA should quantify the incident's impact in terms that matter to the business: number of affected users, duration of service degradation, volume of lost or delayed data, and any financial or contractual implications. Impact assessment is important because it calibrates the urgency of the preventive measures. An incident that affected three users for five minutes warrants different follow-up than one that degraded service for all customers for eight hours.

Root cause chain, not a single root cause. Serious incidents rarely have a single root cause. More commonly, they involve a chain of causes in which each link made the next one possible or worse. A good RCA identifies this chain explicitly. For the overlapping-cause incident described earlier, the chain included the CI race condition (which caused the build to be canceled), the artifact attribution bug (which caused the deploy to proceed with a bad reference), and the partial infrastructure apply (which left the system in an inconsistent state). Each link in the chain represents an opportunity for prevention.

Distinction between proximate and systemic causes. The proximate cause is the specific event that triggered the incident: a bad deploy, a misconfigured setting, an expired certificate. The systemic cause is the organizational or architectural condition that allowed the proximate cause to have the impact it did: lack of deploy validation, absence of automated rollback, insufficient monitoring coverage. Good RCAs address both, because fixing only the proximate cause leaves the systemic vulnerability in place for the next incident.

Contributing factors. Not every condition that influenced the incident's severity is a root cause. The notification misconfiguration that delayed awareness in the overlapping-cause incident was a contributing factor, not a root cause. It did not cause the outage, but it significantly extended its duration. Contributing factors deserve their own section in the RCA because they represent independent improvement opportunities.

Preventive measures that are specific and actionable. A good RCA concludes with concrete recommendations, not vague aspirations. "Improve our CI pipeline" is not actionable. "Add a validation step in the deploy pipeline that verifies the referenced container image exists in the registry before updating service definitions" is. Each preventive measure should map to a specific link in the root cause chain or a specific contributing factor.

Blameless tone. The purpose of RCA is to improve the system, not to assign blame. Effective RCAs describe what happened in terms of systems and processes, not individual failures. "The deploy pipeline did not validate the container reference" rather than "the engineer who configured the pipeline made a mistake." This framing encourages honest reporting, because people are more likely to surface information about incidents when they trust that the analysis will focus on systemic improvement rather than individual punishment.

The best RCAs are documents that any engineer on the team can read six months later and understand exactly what happened, why it happened, and what was done to prevent it from happening again. They serve as institutional memory, turning painful incidents into durable knowledge.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Centralize access to logs, metrics, and traces: Ensure on-call engineers can query all three telemetry types from a single interface without switching tools.
Document tribal knowledge: Identify the systems where only one or two people know how things work, and write operational runbooks for those areas.
Set up cross-service correlation: Tag requests with trace IDs that flow across service boundaries so you can follow a single request through your entire stack.
Implement proactive agent-driven investigation: Deploy a system like Firetiger so root cause analysis begins the moment an anomaly is detected — and so each PR carries its own deployment-specific monitoring plan from the moment it merges.