What is high-cardinality data in observability?
In data systems, cardinality refers to the number of unique values a particular dimension or field can take. In observability -- the practice of monitoring and understanding software system behavior through metrics, logs, and traces -- cardinality determines how much data your monitoring system actually needs to track and how much it costs to do so.
Low-cardinality dimensions have a small, bounded set of values. HTTP status codes are a classic example: there are roughly five common values (200, 301, 400, 404, 500). HTTP methods have even fewer (GET, POST, PUT, DELETE, PATCH). The environment dimension typically has two or three values (production, staging, development). These dimensions are easy for any monitoring system to handle because the total number of unique combinations stays manageable.
High-cardinality dimensions are different. Customer IDs might have thousands of unique values. User IDs could have millions. Trace IDs and request IDs are unique per event, potentially reaching billions. Container IDs, pod names, and other Kubernetes orchestration identifiers change constantly as infrastructure scales up and down. These are precisely the dimensions that engineers need when debugging real production issues -- you need to filter by a specific customer, trace a specific request, or isolate behavior to a specific container. But they are also the dimensions that cause traditional observability tools to break down, slow down, or become prohibitively expensive.
Understanding why high cardinality matters, why traditional tools struggle with it, and how modern architectures are solving the problem is essential for any engineering team that needs to monitor complex systems at scale.
Why do traditional observability tools struggle with high cardinality?
The core issue is architectural: most traditional observability backends were designed around pre-aggregated time series, and high-cardinality dimensions cause the number of time series to explode combinatorially.
A time series is a sequence of data points indexed by time, identified by a unique combination of a metric name and a set of key-value tag pairs. For example, http_request_duration{service="api", endpoint="/users", status="200"} is one time series. Change any tag value and you get a different series. The total number of time series a system tracks is the product of the unique values across all tag dimensions.
Consider a simple metric with four tags:
service: 10 unique valuesendpoint: 50 unique valuesstatus_code: 5 unique valuesregion: 3 unique values
That produces up to 10 x 50 x 5 x 3 = 7,500 time series. Manageable. Now add customer_id with 1,000 unique values. The total jumps to 7.5 million series. Add Kubernetes orchestration tags -- pod (hundreds of values), node (dozens), container (a few per pod) -- and you can easily reach tens of millions or hundreds of millions of unique series. Each series must be indexed, stored, and made queryable.
One company discovered this the hard way when Kubernetes orchestration tags were silently generating thousands of time series that no one had anticipated. Each pod/node/container combination was a new series, and because pods are ephemeral -- created and destroyed as the system scales -- the total number of series generated over time was far higher than the number active at any given moment. The infrastructure team was paying for millions of series that existed for minutes before being replaced by new ones.
This combinatorial explosion creates three interrelated problems.
Storage and indexing costs grow linearly with cardinality. Traditional time-series databases maintain inverted indexes and in-memory data structures for every active series. More series means more memory, more disk, and more CPU for compaction and garbage collection. Commercial observability platforms pass these costs to users, typically charging per custom metric or per unique time series. Adding a high-cardinality dimension like customer ID does not just increase costs proportionally -- it multiplies them across every other dimension the metric already has.
Pre-aggregation destroys the granularity you need. To manage cardinality, traditional systems encourage or require pre-aggregation: computing sums, averages, or percentiles at write time and discarding the raw data points. This reduces the number of stored series but permanently destroys information. If you pre-aggregate your latency metric across all customers, you cannot later ask "what was the P99 latency for customer X last Tuesday?" -- that data no longer exists. Pre-aggregation forces you to decide at instrumentation time which questions you will want to ask later. In practice, you will always get this wrong for the question that matters most during an incident.
Vendor-imposed limits create hard ceilings. One company hit their metrics vendor's 35-attribute limit on a single metric, which directly prevented them from adding customer-specific dimensions to their most critical measurements. The workaround was to create separate metrics for different customer segments, fragmenting their monitoring setup and making cross-customer analysis impossible from a single query. Another platform engineering team was advised by their vendor to avoid high-cardinality tagging entirely. The vendor's guidance was to pre-aggregate at the application layer -- meaning application developers had to understand metric types (counters vs. gauges vs. histograms), aggregation semantics (sum vs. average vs. quantile), and cardinality implications before emitting any telemetry. This transforms instrumentation from a simple "record what happened" operation into a complex engineering decision with cost consequences.
Commercial platforms have introduced features to mitigate these issues. Datadog's Metrics Without Limits, for example, allows you to ingest high-cardinality data but configure which tag combinations are queryable after the fact. In theory, you send everything and choose what to pay for at query time. In practice, these are workarounds rather than solutions. You still pay for ingesting the full volume of high-cardinality data. The queryable aggregations are computed post-hoc from the ingested data, but you lose the ability to freely explore dimensions you did not pre-select. And the pricing model still penalizes cardinality -- you are just given more granular knobs to control which cardinality you pay for.
The net result is a chilling effect on instrumentation. Teams become afraid to add tags because of cost. One engineering team described the progression: in the early days of adopting their observability platform, every new metric and tag brought genuine visibility and felt like a superpower. Over time, as costs grew, the same action that once brought clarity brought dread. Engineers hesitated to add a tag because the last time someone did, it caused a cardinality explosion that blew up the monthly bill. The platform that was supposed to provide visibility was actively discouraging the team from seeking it.
This is the fundamental tension: high-cardinality data is exactly what you need to debug real production issues -- specific customers, specific requests, specific infrastructure components -- and it is exactly what traditional observability tools penalize you for collecting.
How can modern data architectures handle high-cardinality observability data?
The solution to the high-cardinality problem requires rethinking the storage and query model rather than adding workarounds to an architecture that was not designed for it. Modern approaches draw on techniques from the data engineering world -- columnar storage, open table formats, and query-time computation -- to decouple cost from cardinality.
Columnar storage on object storage eliminates the per-series cost model. Instead of storing data as pre-aggregated time series in a specialized database, modern systems store raw telemetry data points in columnar file formats like Apache Parquet on commodity object storage like Amazon S3. Parquet organizes data by column rather than by row, which means reading a single dimension (like customer_id) does not require scanning every other field in the dataset. Object storage costs roughly $0.023 per gigabyte per month, and that price is the same whether your data contains 5 unique tag values or 5 million. There is no per-series charge, no cardinality penalty, and no combinatorial explosion in the cost model.
This is a fundamental economic shift. In the traditional model, cost scales with metrics x cardinality x retention. In the columnar-on-object-storage model, cost scales with data volume x retention. Adding a new dimension to your telemetry increases data volume marginally (one more column) but does not trigger the multiplicative explosion that the per-series model creates.
Open table formats provide structure and performance. Raw files on object storage would be impractical to query without organization. Table formats like Apache Iceberg add metadata layers on top of Parquet files: partition management, schema evolution, snapshot isolation, and statistical metadata about file contents. This metadata allows query engines to perform partition pruning and file skipping -- determining which files could possibly contain relevant data before reading any actual content. A query filtering for a specific customer ID can skip the vast majority of data files based on metadata alone, reading only the subset that might contain matching rows.
Query-time computation replaces write-time aggregation. Instead of pre-computing aggregations when data is written and discarding the raw values, modern systems store the raw data and compute aggregations when queries are executed. Analytical query engines optimized for columnar data -- DuckDB, Trino, and purpose-built systems in this lineage -- can scan billions of rows efficiently using vectorized execution, predicate pushdown, and columnar compression. A query like "compute P99 latency by customer for the last hour" might scan tens of gigabytes of Parquet data, but it does so by reading only the relevant columns (timestamp, customer_id, latency) and skipping files outside the time range.
This approach changes the tradeoff from "fast queries, expensive data" to "slightly slower queries, dramatically cheaper data." A query that might return in 200 milliseconds against a pre-aggregated time-series database might take 2-5 seconds against a columnar data lake. For a human staring at a dashboard, that difference matters. For an automated system or AI agent running hundreds of queries in parallel, it is negligible. And the freedom to ask any question of the data -- not just the questions you pre-configured at instrumentation time -- is worth far more than the latency difference.
The practical impact is measurable. Consider a B2B SaaS company with 1,000 customers, 50 services, 200 endpoints, and 5 status code categories. In a traditional time-series system, adding a customer_id tag produces up to 50 million unique series for a single metric. At typical commercial pricing of a few cents per custom metric per month, a handful of high-cardinality metrics can easily cost tens of thousands of dollars monthly. The same data stored as raw Parquet on S3 might occupy a few hundred gigabytes per month, costing single-digit dollars in storage. The compute cost for queries is additional but bounded by the actual query volume, not by the cardinality of the data at rest.
One AI inference platform (Firetiger is one example of a company building on this architecture) demonstrated the approach in practice: storing all telemetry in Apache Iceberg on S3, with a query engine built on DuckDB that pushes filters down to the Parquet level. The system handles per-customer metrics across hundreds of organizations without cardinality-based cost scaling, enabling features like per-customer SLO monitoring that would be economically impractical in a traditional metrics backend.
The shift from pre-aggregated time series to columnar storage with query-time computation is not just a cost optimization. It is an architectural change that removes the artificial ceiling on what questions engineers can ask of their data. When adding a dimension costs nothing extra, teams instrument aggressively. When queries can explore any combination of dimensions, investigations go deeper. When retention is cheap, historical analysis becomes practical. The high-cardinality data that traditional tools treat as a problem becomes, in a modern architecture, the raw material for understanding complex systems at the level of detail that real debugging requires.
Where to start
- Inventory your current tag cardinality: Query your observability platform to find which tags/labels have the most unique values and are driving the most time series.
- Identify tags that are never queried: Check which high-cardinality tags are ingested but never appear in dashboard filters or alert conditions. Remove them.
- Evaluate your cost per unique time series: Understand how your vendor charges for cardinality -- per series, per metric point, or per GB of storage.
- Consider a columnar storage approach: Evaluate platforms built on Apache Iceberg or ClickHouse (like Firetiger) where high cardinality is a storage problem, not an indexing cost problem.