What is observability cost optimization?

Observability cost optimization is the practice of reducing what you spend on observability tooling while maintaining — or improving — visibility. It involves auditing metrics, managing cardinality, tiering data by value, and sometimes rethinking the underlying architecture. Most organizations are paying for enormous amounts of data nobody looks at, while the data they actually need is buried under the noise.

Why it matters

For some engineering organizations, observability spend now rivals or exceeds the cost of the infrastructure being monitored. Across teams Firetiger has audited, observability bills routinely exceed seven figures annually at growth-stage companies, and the optimization-via-cardinality-management path tends to yield 50-70% reductions without losing material visibility. The structural problem is that traditional observability platforms charge for ingestion volume, metric cardinality, and retention — three axes that all scale faster than engineering value. Cost optimization is not about cutting corners on visibility; it's about recognizing that most teams are paying for enormous amounts of data nobody looks at, while the data they actually need (high-cardinality, long-retention, full-fidelity) is the most expensive thing on the bill.

The optimization playbook has well-understood components: audit the metrics nobody dashboards on, manage high-cardinality dimensions that explode the time-series count, tier data by access pattern, and where possible move long-retention storage to columnar formats on object storage.

Why do observability costs spiral out of control?

Observability cost problems rarely start on day one. They follow a predictable lifecycle: early excitement, quiet accumulation, and eventual crisis.

The honeymoon phase

When a team first adopts an observability platform, every metric and tag delivers genuine value. Engineers add instrumentation liberally because each new data point brings clarity. Dashboards multiply. Custom metrics proliferate. The return on investment feels obvious.

The accumulation phase

Over time, the relationship shifts. Teams move on to new projects, but the metrics they created stay behind. Dashboards go stale. Monitors fire on thresholds nobody remembers setting. Nobody removes old instrumentation because nobody is sure what is still needed -- and the fear of losing visibility during an incident makes deletion feel risky. The telemetry volume grows quietly in the background, and with it, the bill.

The crisis phase

Eventually, someone notices that the observability bill has doubled or tripled. A cost review reveals that the team is paying for thousands of custom metrics, many of which are not used in any dashboard, monitor, or alert. But by now, the tangle of instrumentation is so complex that untangling it feels like a project in itself.

The incentive misalignment at the root

The fundamental reason observability costs spiral is a misalignment between the vendor's business model and the customer's interests. Most observability vendors use pricing models based on data volume: the more metrics you send, the more you pay. For example, Datadog charges per-host ($15-31/host/month for infrastructure monitoring) with additional per-feature pricing for APM, logs, and custom metrics. New Relic charges per-GB of ingested data ($0.30/GB after a free tier) plus per-user fees ($549/user/month for full platform access). This means the vendor profits when you ingest more data, regardless of whether that data delivers any value to your engineering team.

This creates a perverse dynamic. The same action that once made engineers love the product -- adding a tag to get visibility into a new dimension -- eventually becomes something they dread, because last time someone did that, it caused an unexpected spike in the bill. The product punishes you for using it the way it was designed to be used.

The Kubernetes cardinality trap

One of the most common sources of hidden cost is tag cardinality, and Kubernetes environments are particularly prone to this problem.

Every unique combination of tag values creates a new time series. If a metric has tags for pod_name, node, namespace, deployment, and container, the number of time series can explode combinatorially. One platform engineering team at a mid-size SaaS company discovered that Kubernetes orchestration tags alone were responsible for a significant portion of their observability bill. Tags like pod_name -- which changes with every deployment -- were generating thousands of ephemeral time series that existed for minutes before the pods were replaced, but the data points were already ingested and billed.

This is not an unusual story. In containerized environments, tag cardinality is the single largest driver of unexpected cost growth.

Unused metrics and zombie dashboards

The other major cost driver is simpler: metrics that nobody uses. Studies of real-world observability deployments consistently find that a large percentage of custom metrics are not referenced by any dashboard, monitor, SLO, or notebook. They are ingested, stored, and billed for, but never queried.

Similarly, dashboards accumulate. Teams create dashboards for specific investigations, then never return to them. The dashboards persist, and more importantly, the metrics feeding them persist.

How can teams optimize observability costs?

Organizations that have approached this problem systematically have achieved dramatic results. Reductions of 60% or more in observability spending are not uncommon, and some teams have achieved this without any loss of meaningful visibility.

The key strategies fall into four categories.

1. Metric auditing

The first and highest-impact step is simply finding out what you are paying for and whether anyone is using it. A metric audit involves:

Inventorying all custom metrics and their associated costs. Most observability platforms provide APIs or usage dashboards that can surface this data.
Cross-referencing metrics against usage. For each metric, determine whether it appears in any active dashboard, monitor, SLO, or alert rule. Metrics that appear nowhere are candidates for removal.
Identifying stale dashboards and monitors. A dashboard that has not been viewed in six months, or a monitor that has been muted indefinitely, is a signal that the metrics feeding it may also be unnecessary.
Tracking cost by team or service. Attributing observability cost to the teams generating the data helps create accountability and awareness.

The challenge is that metric auditing requires ongoing effort. A one-time audit yields significant savings, but without continuous attention, the same accumulation pattern repeats within months. This is why some organizations use automated auditing tools that continuously scan for unused metrics and surface optimization opportunities.

2. Cardinality management

Cardinality -- the number of unique time series created by combinations of tag values -- is the single most impactful lever for cost reduction in most environments.

Effective cardinality management involves:

Identifying high-cardinality tags. Tags with many unique values (like pod_name, request_id, or user_id) generate proportionally more time series. Some were added for a one-time debugging session and never removed.
Removing tags that are not queried. If a tag is never used in a group-by, filter, or aggregation, it is inflating cost without providing value.
Aggregating where possible. Instead of tracking per-pod metrics, consider whether per-deployment or per-service aggregations provide sufficient visibility.
Setting cardinality limits. Some platforms allow you to cap the number of unique tag values for a given metric, preventing unexpected explosions.

A practical example: a metric tracking HTTP request latency with tags for service, endpoint, method, status_code, pod_name, and availability_zone might generate tens of thousands of time series. If pod_name is never used in any query, removing that single tag could reduce the time series count -- and cost -- by an order of magnitude.

3. Data tiering

Not all observability data has the same value over time. Data from the last hour is critical for incident response. Data from last week is useful for trend analysis. Data from six months ago is rarely queried and primarily kept for compliance or capacity planning.

Data tiering strategies include:

High-resolution recent data. Keep full-fidelity, unaggregated data for the most recent window (e.g., the last 24-48 hours) where it is most likely to be needed for active debugging.
Aggregated historical data. Roll up older data into lower-resolution aggregations (e.g., 1-minute or 5-minute averages) that are sufficient for trend analysis but dramatically cheaper to store.
Cold storage for compliance. For data that must be retained for regulatory or audit purposes but is rarely queried, move it to the cheapest available storage tier.

Keeping every data point at full resolution indefinitely is rarely justified by how often historical data is actually queried.

4. Dual-write strategies for migration

Organizations considering a move to a different observability architecture often use a dual-write strategy during the transition. This involves sending telemetry data to both the existing platform and the new one simultaneously, allowing teams to validate the new system before decommissioning the old one. While dual-writing temporarily increases total cost, it reduces risk. The key is to set a clear timeline so the dual-write period does not become permanent.

What role does architecture play in observability cost management?

The optimization strategies described above -- auditing, cardinality management, and tiering -- are essential practices, but they operate within the constraints of the existing architecture. For some organizations, the most impactful change is rethinking the architecture itself.

The traditional model: pre-aggregated vendor databases

Most commercial observability platforms store data in proprietary, pre-aggregated time-series databases optimized for sub-second dashboard queries. This architecture is excellent for the human experience of debugging during an incident, but it comes with a cost structure that is inherently expensive at scale. The database must index every time series for fast retrieval, which means every unique combination of metric name and tag values consumes storage and compute resources. The vendor passes this cost through as a per-metric-point or per-time-series charge.

This is why cardinality is so expensive in traditional platforms: each new tag combination creates a new entry in a high-performance index designed for millisecond query latency.

The data lake alternative: object storage and open formats

A fundamentally different approach is emerging, built on the same data lake architecture that transformed analytics and data engineering. Instead of pre-aggregating data into a proprietary database, this approach stores raw telemetry in open formats like Apache Parquet on object storage (such as Amazon S3), organized using table formats like Apache Iceberg.

This architecture changes the cost model in several important ways:

Storage is dramatically cheaper. Object storage costs roughly $0.023 per GB per month, orders of magnitude less than the effective per-GB cost of proprietary observability databases.
Cardinality becomes a storage problem, not an indexing problem. High-cardinality data can be retained without the cost penalty imposed by traditional time-series indexes. Columnar formats like Parquet allow queries to read only the columns they need, making selective access to wide datasets practical.
Compute scales independently of storage. Query engines can be provisioned on demand (e.g., on serverless compute) and scaled horizontally. You pay for compute only when queries run, not for maintaining a permanently provisioned index.
Data is stored in open formats. There is no vendor lock-in at the storage layer. The same data can be queried by multiple tools, and switching engines does not require re-ingesting data.

The tradeoff is query latency. A query against Parquet files on S3 will not return in milliseconds the way a pre-aggregated time-series database will. For a human impatiently refreshing a dashboard during an incident, this matters. But for automated systems, scheduled analyses, or AI agents that can run many queries in parallel, the latency tradeoff is often acceptable.

For example, Firetiger built its data plane on Apache Iceberg because change-aware deploy verification — reading the PR diff, generating a deployment-specific monitoring plan, watching the deployment, detecting regressions, and investigating root cause — favors broad exploration of rich datasets over sub-second dashboard queries. The agents are cardinality-hungry; they want the high-cardinality data that traditional systems force teams to drop, and they compensate for higher per-query latency by running many investigations in parallel.

Making the shift practical

Moving from a traditional observability platform to a data lake architecture is not an overnight change. Practical steps include:

Start with the highest-cost, lowest-value data. Metrics that are expensive due to cardinality but rarely queried by humans are ideal candidates for migration to cheaper storage.
Preserve high-value real-time paths. Critical alerts and on-call dashboards may still need low-latency access. A hybrid approach -- real-time for the most critical signals, data lake for everything else -- is a common pattern.
Invest in data freshness. Streaming writes to Iceberg tables, combined with continuous compaction to manage the "small files problem," are necessary to avoid trading cost savings for stale data.

The broader shift

The architectural question is ultimately about separating the cost of storing data from the cost of querying it. Traditional observability platforms bundle these together, which means you pay the query-optimized price for every data point, even if it is never queried. Data lake architectures decouple them, allowing organizations to store everything cheaply and pay for compute only when they actually need answers.

This mirrors the transformation that happened in data analytics a decade ago, when data warehouses gave way to data lakes and lakehouse architectures. Observability is following the same trajectory, driven by the same economic forces: data volumes are growing faster than budgets, and the old cost models are breaking.

For engineering teams facing an observability cost crisis, the path forward involves both tactical optimization -- auditing, cardinality management, and tiering -- and strategic architectural choices about how telemetry data is stored and queried. The organizations that address both levels will be best positioned to maintain the visibility they need without the costs they cannot sustain.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Audit unused metrics: Query your observability platform to find metrics, dashboards, and monitors that haven't been viewed in 90+ days. Delete or archive them.
Check your cardinality: Identify high-cardinality tags (like pod_name or container_id) that inflate costs without being queried. Remove or aggregate them.
Set up dual-write for evaluation: Use an OpenTelemetry collector to send telemetry to both your current platform and a candidate replacement simultaneously.
Evaluate data lake architecture: Assess whether moving historical data to object storage (S3 + Apache Iceberg) could reduce your long-term storage costs.
Implement outcome-oriented monitoring: Deploy a system like Firetiger that reads each PR diff, watches the deployment, and detects regressions — focusing agents on what matters rather than ingesting and indexing everything.