What is a data lake for observability?

A data lake for observability stores raw telemetry — logs, metrics, traces — on low-cost object storage in open file formats like Apache Parquet, with query engines spinning up on demand to scan the data. The pattern is well-established in data engineering; what is new is its application to operational telemetry, driven by AI agents that want full-fidelity data and can wait a few seconds for answers.

Why it matters

The economics of traditional observability platforms scale poorly: cost grows with ingest volume, metric cardinality, and retention, and the bill grows faster than the underlying value. Across teams Firetiger has worked with, observability spend at growth-stage companies typically runs into the high six figures or more annually — and most of that is paid for fast indexing the team rarely uses, while the queries that actually matter during incidents are arbitrary slices that the indexing wasn't built for. The data-lake pattern flips the cost curve: store raw telemetry in open columnar formats on object storage, query on demand with engines like ClickHouse or DuckDB, and pay for ingest, storage, and compute as separate axes. The trade-off is some additional query latency, which is acceptable for AI agents and for ad-hoc investigation — the two workloads that matter most.

Traditional observability platforms ship telemetry into proprietary, pre-indexed databases optimized for sub-second dashboard queries. The data-lake approach treats telemetry as just another analytical dataset: store it in open formats, query it on demand, control your own costs, and avoid vendor lock-in.

This architectural pattern has been common in data engineering for years, powering analytics at companies that process petabytes of event data. What is new is its application to operational telemetry, driven by the realization that the consumers of observability data are changing. AI agents, automated runbooks, and programmatic analysis tools do not need sub-second dashboard refreshes. They need access to all the data, the ability to ask arbitrary questions, and they are patient enough to wait a few seconds for answers.

Why are data lakes becoming important for observability?

The economics of traditional observability have reached a breaking point for many organizations. When a monitoring vendor charges per ingested gigabyte or per unique metric time series, teams are forced into difficult tradeoffs. They drop high-cardinality fields, sample traces aggressively, reduce log retention, or simply avoid instrumenting parts of their systems that generate too much data. The result is that the observability platform has gaps precisely where the most interesting debugging information lives.

Data lakes invert this cost equation. Object storage costs roughly one-tenth of what traditional database storage costs per gigabyte, and the price continues to drop. When storage is cheap, you stop making tradeoffs about what data to keep. You can retain full request traces with all their attributes, keep every log line at full fidelity, and store high-cardinality metrics that would be prohibitively expensive in a traditional platform. For one observability platform, this meant offering near-unlimited data ingestion to customers, because the underlying storage cost was no longer the constraint. For example, Firetiger built its data plane on Apache Iceberg with Parquet files on S3, queried using an embedded DuckDB engine — the same substrate powers change-aware deploy verification (reading the PR diff, generating a deployment-specific monitoring plan, watching the deployment, detecting regressions, and investigating root cause) without cardinality-based cost scaling.

The second driver is analytical flexibility. Traditional observability platforms require you to decide upfront which questions you want to ask. You create dashboards, define alerts, and choose which fields to index. This works for known failure modes, but production incidents rarely follow the script. When something unexpected breaks, you need to ask questions you did not anticipate, slicing data across dimensions that were never indexed. With a data lake, the raw data is always available for arbitrary queries. You do not lose resolution to pre-aggregation, and you are not limited to pre-built views.

The third driver is the rise of AI agents in operations. Agents approach observability data differently than humans do. A human running an investigation might execute a handful of queries, scanning dashboards for anomalies. An agent might execute hundreds of queries in parallel, testing hypotheses across time windows, comparing dimensions, and ruling out false correlations. Agents are cardinality-hungry: they want the per-request attributes and full tag sets that traditional platforms would drop to control costs. They are also patient: an agent running ten parallel investigations does not care if each query takes three seconds instead of 300 milliseconds. This access pattern, high query volume over rich data with relaxed latency requirements, is exactly what data lake architectures excel at.

The separation of storage and compute is what makes this practical. In a data lake architecture, storage is durable and shared on object storage, while compute is provisioned independently on demand. Query engines can scale horizontally to handle bursts of concurrent queries, and each query gets isolated resources so a runaway analysis does not starve other work. This means you can support both a human doing quick lookups and an agent running deep investigations against the same underlying data, without one degrading the other.

What are the trade-offs of data lake observability vs. traditional SaaS?

The most immediate trade-off is query latency. Traditional observability platforms are optimized for the dashboard experience: type a query, see results in under a second. Data lake architectures typically deliver results in seconds to minutes for complex queries, depending on the volume of data scanned and the compute resources allocated. For a human staring at a pager alert at 3 AM, that difference matters. For an agent methodically working through a diagnostic playbook, it does not.

This latency gap is narrowing. Modern query engines with predicate pushdown, columnar storage optimizations, and intelligent metadata layers can eliminate large amounts of data without reading it. For example, a well-partitioned Iceberg table can use file-level statistics to skip entire partitions, meaning a query that logically spans terabytes might physically read only megabytes. But the gap has not closed entirely, and for use cases that require true real-time dashboards with instant drill-down, traditional indexed databases still have an edge.

Cost structure is another area where the trade-off is more nuanced than it first appears. Data lake storage is dramatically cheaper per gigabyte, but you still pay for compute at query time. If you run a small number of queries over large datasets, the economics are very favorable. If you run an enormous number of queries continuously, the compute costs can add up. The key difference is that storage costs are predictable (you know how much data you have), while query costs depend on usage patterns. This contrasts with traditional platforms where you pay per data point ingested regardless of whether anyone ever queries it.

Control and data ownership represent a significant advantage for the data lake approach. Your telemetry lives in your own infrastructure, in open formats that any compatible tool can read. You are not locked into a vendor's proprietary query language or data format. If you want to switch query engines, run custom analysis with a data science tool, or feed the data into a machine learning pipeline, the data is already there in a standard format. With traditional SaaS observability, your data lives in the vendor's infrastructure, accessible only through their APIs and query interfaces.

Operational complexity is the primary disadvantage. Running a data lake for observability is not as simple as pointing a collector at an S3 bucket. You need an ingestion pipeline that can handle streaming data at scale, a table format that supports concurrent writes and reads, continuous compaction to prevent the small files problem from degrading query performance, metadata management, data retention policies, and a query engine that understands how to efficiently scan columnar data. Traditional SaaS platforms handle all of this for you. With a data lake, someone has to build and maintain these systems, whether that is your team, an open-source community, or a vendor that provides a managed data lake experience.

The operational burden is real and should not be underestimated. Teams that have built real-time observability on data lakes report that the table maintenance alone, compaction, snapshot expiration, orphan cleanup, and retention enforcement, requires continuous engineering attention. One team found that off-the-shelf compaction solutions could not keep up with their streaming write volume and had to build a custom, event-driven compaction system. This is the kind of hidden complexity that makes data lake observability powerful but demanding.

For many organizations, the right answer is not choosing one approach over the other, but understanding where each excels. Traditional indexed platforms remain strong for real-time alerting and interactive debugging. Data lakes excel at deep analysis, long-term retention, agent-driven investigation, and any use case where you need to ask questions you did not anticipate. The architecture of modern observability is increasingly a combination of both, with OpenTelemetry providing a common instrumentation layer that can send data to multiple destinations simultaneously.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Assess your current data retention costs: Calculate what you're paying per GB/month for long-term telemetry storage in your current platform.
Evaluate Apache Iceberg: Set up a proof of concept with Parquet files on S3 and an Iceberg catalog to understand the query experience.
Set up an OTel Collector for dual-write: Send telemetry to both your current platform and a data lake simultaneously during evaluation.
Consider agent-friendly architectures: Use platforms like Firetiger that are built on data lakes from the ground up — designed for change-aware deploy verification and AI agent query patterns rather than human dashboard patterns.