Learning Center/Observability Architecture

What is Apache Iceberg?

Apache Iceberg is an open table format designed for large analytic datasets. It sits between your query engine and your object storage (such as Amazon S3), providing a structured metadata layer that makes it possible to run SQL queries against files stored on cheap, durable storage. Think of it as the organizational system that turns a pile of data files into something that behaves like a proper database table, with schemas, partitions, and transactional guarantees.

Before Iceberg, querying data on object storage meant dealing with fragile conventions: directory structures that encoded partitions, schema changes that broke readers, and no real way to handle concurrent writes safely. Iceberg replaces these conventions with a formal specification. Every table has a metadata tree that tracks which data files belong to which snapshot, what the schema looks like, and how the data is partitioned. This metadata is itself stored as files on object storage, meaning the entire table, data and metadata alike, lives in a format that any compatible engine can read.

Iceberg has become one of the most widely adopted open table formats, alongside alternatives like Databricks Delta Lake and Apache Hudi, because it solves a practical problem: how do you get the analytical power of a data warehouse without paying data warehouse prices or accepting vendor lock-in? By storing data as Parquet files on object storage and layering Iceberg's metadata on top, organizations can use any compatible query engine (Spark, Trino, DuckDB, Snowflake, and many others) to access the same underlying data.

Why is Apache Iceberg used for observability data?

Observability generates enormous volumes of data. A moderately sized production environment can produce gigabytes of logs, metrics, and traces per hour. Storing this data in a traditional indexed observability database becomes expensive quickly, especially when you want to retain weeks or months of history for root cause analysis and trend detection.

Iceberg's columnar storage model, built on Parquet files, is particularly well-suited to observability queries. Telemetry data tends to be wide (many fields per record) but queries typically touch only a few columns at a time. A columnar layout means the query engine reads only the columns it needs, skipping everything else. When you combine this with Iceberg's file-level statistics and predicate pushdown, the query engine can determine which data files might contain relevant rows without reading them. A query for errors in a specific service over the last hour might skip 99% of the stored data entirely, reading only the few files that could possibly match.

Partition evolution is another feature that makes Iceberg attractive for observability workloads. In traditional partitioned storage, changing your partition scheme (say, from daily to hourly partitions as your data volume grows) requires rewriting all existing data. Iceberg handles partition evolution transparently: you change the partition spec going forward, and old data remains organized under the old scheme while new data follows the new one. The query engine understands both and plans accordingly. For observability data, where access patterns evolve as systems grow and teams learn what questions matter, this flexibility is valuable. Firetiger chose Apache Iceberg for its observability data lake because of its open format, partition evolution support, and predicate pushdown capabilities -- enabling AI agents to query billions of telemetry records efficiently.

The open format is equally important. Observability data locked in a proprietary system can only be queried through that system's interfaces. With Iceberg, the same telemetry data can be queried by an operational agent investigating an incident, analyzed by a data scientist looking for long-term reliability trends, or fed into a machine learning pipeline for anomaly detection. There is no need to export, transform, or duplicate the data. Multiple tools read the same files, each bringing their own strengths.

Schema evolution rounds out the picture. Observability data schemas change constantly. New services add new log fields, tracing libraries introduce new span attributes, and infrastructure upgrades produce new metric dimensions. Iceberg supports adding, dropping, renaming, and reordering columns without rewriting existing data. Old data simply has null values for new columns, and queries handle this transparently. In a domain where schema rigidity would mean either losing data or constantly running migrations, this is a significant practical advantage.

What are the challenges of using Apache Iceberg for real-time data?

The fundamental tension in using Iceberg for real-time observability is the small files problem. Iceberg tables are collections of data files on object storage. When you write data frequently in small batches, as a streaming observability pipeline must, you create many small files. Each file carries fixed overhead: the query engine must read its metadata, open a connection to storage, and process its contents. A table with thousands of tiny files becomes dramatically slower to query than one with the same data compacted into a few large files.

This is not a theoretical concern. Teams building real-time observability on Iceberg consistently report that the small files problem is the gating factor on adoption. The faster you write data to keep it fresh, the more fragmented the table becomes. One engineering team described it as an inescapable tradeoff: great write performance demands many small files, while great read performance demands fewer large files. You cannot optimize for both at the write path; you must clean up afterward.

The cleanup process is called compaction: reading small files and rewriting their contents into larger, better-organized files. Compaction is not optional for streaming Iceberg workloads. It is a continuous background maintenance operation that must keep pace with the ingest rate. If compaction falls behind, query performance degrades. If it runs too aggressively, it competes with ingest and query workloads for resources.

Writing to Iceberg has historically been complex in its own right. For much of Iceberg's life, producing data to an Iceberg table required Apache Spark or another heavyweight distributed processing framework. One team described it as "basically impossible" to write to Iceberg without Spark in the early days. The ecosystem has since expanded, with lighter-weight writers emerging in Go, Rust, Python, and other languages, but the tooling gap meant that early adopters of Iceberg for real-time use cases had to build significant custom infrastructure.

Beyond compaction, streaming Iceberg tables require several other maintenance operations. Snapshot expiration prevents unbounded growth of table metadata: every write creates a new snapshot, and old snapshots must be periodically cleaned up so that the metadata layer itself does not become a bottleneck. Orphan file cleanup removes data files that are no longer referenced by any snapshot but still occupy storage. Data retention enforcement deletes data that has aged past its useful life. Each of these operations must run reliably and continuously. Neglecting any one of them leads to gradual degradation that can be difficult to diagnose.

Some teams have found that off-the-shelf solutions for these maintenance tasks cannot keep up with observability-scale streaming workloads. For example, one organization tried using a cloud provider's built-in compaction service but found it could not handle their write volume and shape. They ended up building a custom, event-driven compaction system that consumed object storage notifications for new writes and planned compaction jobs in parallel on serverless compute.

The landscape is improving. Specialized streaming ingest pipelines now handle the complexity of writing to Iceberg efficiently, merging concurrent writes into fewer commits to reduce file proliferation at the source. Managed Iceberg services from cloud providers and data platform vendors are beginning to handle compaction and maintenance automatically. But for now, teams adopting Iceberg for real-time observability should expect to invest meaningful engineering effort in the write path and maintenance layer, or choose a vendor that has already solved these problems.

Despite these challenges, the trajectory is clear. The cost and flexibility advantages of Iceberg-based observability are compelling enough that teams are willing to invest in solving the real-time challenges. The small files problem, compaction, and write-path complexity are engineering problems with known solutions, not fundamental limitations. As the ecosystem matures, the barrier to entry will continue to drop, making Iceberg-based observability accessible to a broader range of organizations.

Where to start

  • Understand your query patterns: Determine whether your observability workload is primarily dashboard-driven (real-time) or investigation-driven (ad-hoc analytical queries).
  • Start with historical data: Move older telemetry data to an Iceberg-based store first, keeping recent data in your existing platform for real-time dashboards.
  • Plan for compaction: If writing data at high frequency, ensure you have a compaction strategy to prevent the small files problem from degrading query performance.
  • Evaluate managed Iceberg services: Look at platforms that handle the write complexity and compaction for you rather than building a custom Spark pipeline.

Firetiger uses AI agents to monitor production, investigate incidents, and optimize infrastructure — autonomously. Learn more about Firetiger, get started free, or install the Firetiger plugin for Claude or Cursor.