Learning Center/Observability Architecture

What is a data lake for observability?

Observability has traditionally meant sending your logs, metrics, and traces to a specialized SaaS platform that indexes everything in real time for sub-second dashboard queries. This model works, but it comes with a cost structure that scales poorly: you pay per gigabyte ingested, per metric cardinality, per data point retained. As data volumes grow, the bill grows faster.

A data lake for observability takes a fundamentally different approach. Instead of shipping telemetry into a proprietary, pre-indexed database, you store it on low-cost object storage (such as Amazon S3 or Google Cloud Storage) in open file formats like Apache Parquet. The data is not pre-aggregated or pre-indexed. Instead, it sits in its raw, full-fidelity form and is queried on demand by compute engines -- such as ClickHouse, Snowflake, and Databricks, or lightweight embedded engines like DuckDB -- that spin up as needed.

This architectural pattern has been common in data engineering for years, powering analytics at companies that process petabytes of event data. What is new is its application to operational telemetry, driven by the realization that the consumers of observability data are changing. AI agents, automated runbooks, and programmatic analysis tools do not need sub-second dashboard refreshes. They need access to all the data, the ability to ask arbitrary questions, and they are patient enough to wait a few seconds for answers.

Why are data lakes becoming important for observability?

The economics of traditional observability have reached a breaking point for many organizations. When a monitoring vendor charges per ingested gigabyte or per unique metric time series, teams are forced into difficult tradeoffs. They drop high-cardinality fields, sample traces aggressively, reduce log retention, or simply avoid instrumenting parts of their systems that generate too much data. The result is that the observability platform has gaps precisely where the most interesting debugging information lives.

Data lakes invert this cost equation. Object storage costs roughly one-tenth of what traditional database storage costs per gigabyte, and the price continues to drop. When storage is cheap, you stop making tradeoffs about what data to keep. You can retain full request traces with all their attributes, keep every log line at full fidelity, and store high-cardinality metrics that would be prohibitively expensive in a traditional platform. For one observability platform, this meant offering near-unlimited data ingestion to customers, because the underlying storage cost was no longer the constraint. For example, Firetiger built its observability data lake on Apache Iceberg with Parquet files on S3, queried by AI agents using an embedded DuckDB engine -- demonstrating the data lake approach at production scale.

The second driver is analytical flexibility. Traditional observability platforms require you to decide upfront which questions you want to ask. You create dashboards, define alerts, and choose which fields to index. This works for known failure modes, but production incidents rarely follow the script. When something unexpected breaks, you need to ask questions you did not anticipate, slicing data across dimensions that were never indexed. With a data lake, the raw data is always available for arbitrary queries. You do not lose resolution to pre-aggregation, and you are not limited to pre-built views.

The third driver is the rise of AI agents in operations. Agents approach observability data differently than humans do. A human running an investigation might execute a handful of queries, scanning dashboards for anomalies. An agent might execute hundreds of queries in parallel, testing hypotheses across time windows, comparing dimensions, and ruling out false correlations. Agents are cardinality-hungry: they want the per-request attributes and full tag sets that traditional platforms would drop to control costs. They are also patient: an agent running ten parallel investigations does not care if each query takes three seconds instead of 300 milliseconds. This access pattern, high query volume over rich data with relaxed latency requirements, is exactly what data lake architectures excel at.

The separation of storage and compute is what makes this practical. In a data lake architecture, storage is durable and shared on object storage, while compute is provisioned independently on demand. Query engines can scale horizontally to handle bursts of concurrent queries, and each query gets isolated resources so a runaway analysis does not starve other work. This means you can support both a human doing quick lookups and an agent running deep investigations against the same underlying data, without one degrading the other.

What are the trade-offs of data lake observability vs. traditional SaaS?

The most immediate trade-off is query latency. Traditional observability platforms are optimized for the dashboard experience: type a query, see results in under a second. Data lake architectures typically deliver results in seconds to minutes for complex queries, depending on the volume of data scanned and the compute resources allocated. For a human staring at a pager alert at 3 AM, that difference matters. For an agent methodically working through a diagnostic playbook, it does not.

This latency gap is narrowing. Modern query engines with predicate pushdown, columnar storage optimizations, and intelligent metadata layers can eliminate large amounts of data without reading it. For example, a well-partitioned Iceberg table can use file-level statistics to skip entire partitions, meaning a query that logically spans terabytes might physically read only megabytes. But the gap has not closed entirely, and for use cases that require true real-time dashboards with instant drill-down, traditional indexed databases still have an edge.

Cost structure is another area where the trade-off is more nuanced than it first appears. Data lake storage is dramatically cheaper per gigabyte, but you still pay for compute at query time. If you run a small number of queries over large datasets, the economics are very favorable. If you run an enormous number of queries continuously, the compute costs can add up. The key difference is that storage costs are predictable (you know how much data you have), while query costs depend on usage patterns. This contrasts with traditional platforms where you pay per data point ingested regardless of whether anyone ever queries it.

Control and data ownership represent a significant advantage for the data lake approach. Your telemetry lives in your own infrastructure, in open formats that any compatible tool can read. You are not locked into a vendor's proprietary query language or data format. If you want to switch query engines, run custom analysis with a data science tool, or feed the data into a machine learning pipeline, the data is already there in a standard format. With traditional SaaS observability, your data lives in the vendor's infrastructure, accessible only through their APIs and query interfaces.

Operational complexity is the primary disadvantage. Running a data lake for observability is not as simple as pointing a collector at an S3 bucket. You need an ingestion pipeline that can handle streaming data at scale, a table format that supports concurrent writes and reads, continuous compaction to prevent the small files problem from degrading query performance, metadata management, data retention policies, and a query engine that understands how to efficiently scan columnar data. Traditional SaaS platforms handle all of this for you. With a data lake, someone has to build and maintain these systems, whether that is your team, an open-source community, or a vendor that provides a managed data lake experience.

The operational burden is real and should not be underestimated. Teams that have built real-time observability on data lakes report that the table maintenance alone, compaction, snapshot expiration, orphan cleanup, and retention enforcement, requires continuous engineering attention. One team found that off-the-shelf compaction solutions could not keep up with their streaming write volume and had to build a custom, event-driven compaction system. This is the kind of hidden complexity that makes data lake observability powerful but demanding.

For many organizations, the right answer is not choosing one approach over the other, but understanding where each excels. Traditional indexed platforms remain strong for real-time alerting and interactive debugging. Data lakes excel at deep analysis, long-term retention, agent-driven investigation, and any use case where you need to ask questions you did not anticipate. The architecture of modern observability is increasingly a combination of both, with OpenTelemetry providing a common instrumentation layer that can send data to multiple destinations simultaneously.

Where to start

  • Assess your current data retention costs: Calculate what you're paying per GB/month for long-term telemetry storage in your current platform.
  • Evaluate Apache Iceberg: Set up a proof of concept with Parquet files on S3 and an Iceberg catalog to understand the query experience.
  • Set up an OTel Collector for dual-write: Send telemetry to both your current platform and a data lake simultaneously during evaluation.
  • Consider agent-friendly architectures: Use platforms like Firetiger that are built on data lakes from the ground up, designed for AI agent query patterns rather than human dashboard patterns.

Firetiger uses AI agents to monitor production, investigate incidents, and optimize infrastructure — autonomously. Learn more about Firetiger, get started free, or install the Firetiger plugin for Claude or Cursor.