What is observability and how is it different from monitoring?

Monitoring tells you when something is broken; observability helps you figure out why. Monitoring watches predefined metrics for known failure modes. Observability instruments systems to emit rich telemetry — logs, metrics, traces — that you can query after the fact to investigate problems nobody anticipated. Static dashboards are useful for routine health checks, but they cannot cover novel failure modes.

Why it matters

Charity Majors and the observability-2.0 community framed the distinction precisely: production systems fail in ways nobody anticipated, and monitoring's "is this specific thing working?" model cannot help you investigate problems you didn't think to check for. Across teams Firetiger has worked with, the bulk of dashboards are still oriented around availability and infrastructure metrics, while the queries engineers actually run during an incident are arbitrary slices of high-cardinality data — customer IDs, request IDs, trace IDs — that traditional metrics backends can't index efficiently. Observability's value compounds with the team's ability to ask new questions, not with the number of pre-built dashboards.

Monitoring is built around the question "Is this specific thing working?" You define checks in advance: Is the server responding? Is CPU usage below 90%? Is the error rate under 2%? When a check fails, you get an alert. This works well for anticipated failure modes. Observability flips the model: instrument systems to emit rich telemetry, and ask arbitrary questions after the fact. When a user reports their dashboard is slow only on Tuesdays, observability lets you slice the telemetry for that specific scenario, even if nobody wrote a monitor for it.

Observability flips the model. Instead of predefining what to check, you instrument your systems to emit rich telemetry data, and then you ask arbitrary questions after the fact. When a user reports that their dashboard is loading slowly but only on Tuesdays, observability lets you slice and filter your telemetry to explore that specific scenario, even if nobody ever wrote a monitor for "slow dashboard loads on Tuesdays."

Why are dashboards not enough for modern systems?

Dashboards are the most visible artifact of monitoring culture. Teams build them to visualize key metrics: request rates, error rates, latency percentiles, resource utilization. A well-designed dashboard gives you a quick health check of your system. But dashboards have structural limitations that become painful as systems grow in complexity.

The first problem is dashboard sprawl. When an incident occurs, the responder often needs to look at a combination of metrics that no existing dashboard shows. So they build a new one. One engineering team described their experience with a popular dashboard tool as "horrible for ad-hoc investigations." Engineers would create custom dashboards during an incident to correlate specific metrics, use them for thirty minutes, and then never look at them again. Over time, the organization accumulates hundreds of abandoned dashboards, and nobody knows which ones are still relevant.

The second problem is that static dashboards cannot anticipate every failure mode. A dashboard shows you what someone decided was important at the time they built it. But production failures are creative. A memory leak that only manifests under a specific traffic pattern, a database query plan regression triggered by a statistics update, a race condition in a CI pipeline that causes a build artifact to be missing: none of these are likely to have a pre-built dashboard waiting for them. When you are in the middle of an incident, you need to ask questions that nobody has asked before.

The third problem is what some teams call the "DDoS your own observability" effect. During an incident, when you most need your observability tools, you are also placing the highest load on them. Multiple engineers running ad-hoc queries against your metrics backend, pulling up traces, searching logs. If your observability infrastructure is not sized for this burst of investigative traffic, it slows down precisely when speed matters most. Teams have reported situations where their dashboards became unresponsive during major incidents because too many people were querying simultaneously.

None of this means dashboards are useless. They are excellent for routine health checks and for building shared situational awareness across a team. But they are the starting point of an investigation, not the destination. Modern systems need the ability to go beyond what any pre-built view shows, to drill into specific time windows, filter by arbitrary dimensions, and correlate signals across different subsystems.

An emerging approach to this problem is using AI agents to handle the investigative querying. Rather than requiring a human to know which dashboard to check or which query to write, an agent can autonomously search across all available telemetry, correlate anomalies with deployment events, and present findings to the engineer. For example, Firetiger reads each PR diff, identifies which metrics and logs are relevant to that specific change, generates a deployment-specific monitoring plan, watches the deployment, and investigates regressions — sidestepping the dashboard problem entirely. Instead of maintaining a library of pre-built views, the system generates the right queries dynamically for each situation.

What is the difference between logs, metrics, and traces?

The most widely adopted observability platforms today include Datadog, New Relic, Grafana (often paired with Prometheus and Loki), Splunk, and Elastic. Each takes a different approach -- Datadog and New Relic offer fully managed SaaS platforms with per-host or per-GB pricing, while Grafana provides an open-source visualization layer that teams can self-host. Newer platforms like Firetiger take a different approach entirely, using AI agents to query across logs, metrics, and traces dynamically rather than relying on pre-built dashboards.

The three pillars of observability are logs, metrics, and traces. Each captures a different dimension of system behavior, and effective observability requires all three working together.

Logs are timestamped records of discrete events. When a user signs in, a log entry records that event along with relevant context: user ID, IP address, timestamp, result (success or failure), and any error messages. Logs are the most granular form of telemetry. They capture exactly what happened, in the system's own words.

The strength of logs is their richness. You can search for a specific user's activity, find the exact error message that preceded a crash, or trace the sequence of events leading up to a failure. The weakness is volume. A busy production system can generate millions of log lines per hour. Searching unstructured logs at that scale is slow and expensive. Structured logging (where events are emitted in a machine-parseable format with consistent field names) helps significantly, but log management at scale remains one of the harder infrastructure problems.

Metrics are numeric measurements collected over time. Examples include request count per second, average response latency, CPU utilization percentage, and database connection pool usage. Metrics are typically aggregated into time series: a value recorded at regular intervals (every 10 seconds, every minute) that can be graphed and compared over time.

The strength of metrics is efficiency. A single time series showing "requests per second" compresses millions of individual events into a lightweight, queryable stream. You can quickly see trends, spot anomalies, and set thresholds for alerts. The weakness is loss of detail. A metric showing "average latency is 200ms" hides the fact that 99% of requests complete in 50ms while 1% take 15 seconds. Percentile metrics (p50, p95, p99) help, but metrics by their nature aggregate away the individual events that explain why a number changed.

Traces follow a single request as it travels through a distributed system. In a modern architecture, a single user action (like loading a page) might touch a load balancer, an API gateway, three microservices, two databases, and a cache. A trace records the entire journey: which services were called, in what order, how long each step took, and where errors occurred.

The strength of traces is that they show causality and flow. When a page load is slow, a trace can pinpoint whether the bottleneck was in the authentication service, the database query, or the third-party API call. Without traces, debugging latency in distributed systems often involves guesswork and log correlation across multiple services. The weakness is that tracing infrastructure adds overhead and complexity. Instrumenting every service to propagate trace context, collecting and storing traces at scale, and building tooling to visualize them is a significant investment.

In practice, investigations often start with one pillar and jump to another. A metric alert tells you that error rates spiked. You search logs to find the specific error messages. You pull traces to understand which service in the request path is failing. The three pillars are not alternatives to each other; they are complementary perspectives on the same system.

The next frontier in observability is making these three pillars queryable through a single interface rather than requiring engineers to manually correlate across separate tools. Agent-driven systems can traverse all three data types in a single investigation, asking a metrics query to identify the time window, then searching logs within that window for error patterns, then pulling traces to identify the failing component. This kind of cross-pillar investigation is exactly what humans do during incidents, but agents can do it faster and without the cognitive overhead of switching between three different tools.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Audit your current dashboards: Identify which dashboards are actively used during incidents vs. which are stale. Archive or delete the stale ones.
Ensure you have all three pillars: Verify you're collecting logs, metrics, and traces -- and that they're correlated via shared identifiers (trace IDs, request IDs).
Set up ad-hoc query access: Give your on-call engineers the ability to run arbitrary queries against your telemetry, not just view pre-built dashboards.
Evaluate agent-driven observability: Consider platforms like Firetiger that use AI agents to query across all telemetry types dynamically, eliminating the need to pre-build dashboards for every failure mode.