How do you choose an observability platform?

Observability platform choices are sticky — the longer you use one, the more dashboards, alerts, and institutional knowledge accumulate around it. The market has stratified into five categories: SaaS platforms (Datadog, New Relic, Splunk), open-source stacks (Grafana/Prometheus/Loki), managed open-source (Grafana Cloud), BYOC platforms (Chronosphere, Observe), and agent-driven platforms (Firetiger). The right choice depends on cost at scale, operational tolerance, lock-in tolerance, and how much you want AI agents in the loop.

Why it matters

Choosing an observability platform is one of the highest-stakes infrastructure decisions an engineering organization makes. Across teams Firetiger has worked with, observability platform migrations typically take 6-18 months end to end — the runbook conversion, alert migration, and dashboard rewrite alone routinely consume an engineer's full attention for months. The decision determines how you detect incidents, how you investigate production issues, how much you pay for visibility into your own systems, and how difficult it will be to change course if the choice turns out to be wrong. Most observability migrations are driven by cost; a meaningful share are driven by AI-agent readiness — whether the platform's API and data model expose the telemetry that agents need to do useful work.

Observability is also the foundation layer for deploy verification: a tool like Firetiger consumes telemetry from the underlying observability stack. The two decisions are linked. See also Firetiger vs Datadog and Deploy verification vs observability vs incident management.

The market has also changed substantially in the last few years. What was once a straightforward choice between a few SaaS vendors and self-hosted open source has expanded into a landscape with multiple distinct categories, each with different tradeoffs around cost, control, operational burden, and capabilities. Understanding these categories is the first step toward making an informed decision.

Most organizations approach this decision either when they are starting from scratch (a new company or a greenfield platform team) or when their current platform's costs have become unsustainable. The second scenario is far more common. A team that chose Datadog or New Relic three years ago often finds that their bill has grown 3-5x as their infrastructure scaled, and they are now spending more on observability than on the infrastructure being observed. The evaluation process for these two starting points is different: greenfield teams can optimize for the right architecture from the start, while migration teams must also account for the cost and risk of transitioning.

What are the main categories of observability platforms?

The observability market has stratified into five distinct categories, each representing a different set of tradeoffs. Understanding where each platform sits helps narrow the field before getting into feature-by-feature comparisons.

SaaS platforms are the most established category. Datadog, New Relic, and Splunk (now part of Cisco) are the dominant players. These platforms handle everything: you install an agent or SDK, telemetry flows to the vendor's infrastructure, and you get dashboards, alerting, APM, log management, and increasingly AI-powered features through a browser interface. The advantage is zero operational burden for the observability infrastructure itself. The disadvantage is cost at scale and data portability: your telemetry lives in the vendor's proprietary systems, and extracting it to move elsewhere ranges from painful to impossible.

Datadog has become the default choice for many teams because of its breadth. It covers infrastructure monitoring, APM, log management, real user monitoring, synthetic monitoring, security monitoring, and more, all in a single platform. This breadth is genuinely valuable because it means a single pane of glass across signal types. But it comes at a price, and that price increases with every new feature you enable and every new host or container you add.

New Relic took a different approach to market positioning by offering a generous free tier and per-GB pricing that initially appears cheaper than Datadog's per-host model. For smaller teams or those with lower data volumes, New Relic can be significantly less expensive. At scale, however, the per-user fees and per-GB costs add up, and organizations with high data volumes often find themselves in a similar cost position.

Splunk historically dominated the log management space and has expanded into broader observability. Its strength is in log analytics and search, particularly for organizations with compliance or security use cases that require long-term log retention. Its pricing, traditionally based on daily ingestion volume, has been a persistent source of frustration for customers, though the acquisition by Cisco may shift this.

Open-source stacks represent the opposite end of the spectrum. The most common combination is Grafana for visualization, Prometheus for metrics, and Loki for logs, often supplemented with Tempo for distributed tracing and Mimir for long-term metrics storage. The software is free. The operational cost is not: running a production-grade Prometheus cluster with appropriate retention, high availability, and query performance requires significant platform engineering investment. Organizations that choose this path typically have a dedicated platform or infrastructure team with deep expertise in these tools.

The open-source path offers maximum control and zero vendor lock-in at the software layer. You own your data, you control your storage, and you can customize every component. The tradeoff is that you are responsible for everything that goes wrong: upgrades, scaling, data retention, query performance, and the inevitable 3 AM page when the monitoring system itself goes down. For teams with the expertise and headcount, this is a reasonable choice. For teams without a dedicated platform engineering function, it is a significant operational commitment.

Managed open-source attempts to capture the benefits of open source (familiar tooling, data portability, flexible architecture) while offloading operational burden to a vendor. Grafana Cloud is the leading example: it provides hosted versions of Grafana, Prometheus (via Mimir), Loki, and Tempo, with the vendor handling scaling, availability, and upgrades. You interact with the same open-source tools you already know, but you do not run the infrastructure.

The cost model for managed open-source sits between self-hosted (free software, expensive ops) and pure SaaS (expensive software, no ops). Grafana Cloud charges based on usage (metrics series, log volume, trace spans) but typically at lower per-unit rates than Datadog or New Relic because the underlying architecture uses cheaper storage and compute. The tradeoff is that the feature surface may be thinner than the all-in-one SaaS platforms, particularly around APM, real user monitoring, and out-of-the-box integrations.

BYOC (Bring Your Own Cloud) platforms keep your telemetry data in your own cloud account while the vendor provides the software and management layer. Chronosphere and Observe are examples of vendors in this space. The data lives in your S3 buckets or your cloud storage, and the vendor runs the query engine and management plane that operates on that data.

BYOC addresses two concerns that drive organizations away from SaaS: data sovereignty and cost transparency. When your data stays in your cloud account, you know exactly what storage costs, you control retention policies directly, and you eliminate concerns about sensitive data leaving your environment. The operational model is a middle ground: less work than self-hosted open source, but more involvement than pure SaaS because you are managing the storage layer.

Agent-driven platforms represent a newer category that takes a fundamentally different approach to what an observability platform does. Rather than focusing primarily on data ingestion, storage, and human-driven querying, agent-driven platforms use AI agents as the primary consumers of observability data. The agents observe production systems, detect anomalies, investigate root causes, and surface findings to humans, rather than requiring humans to query dashboards and write alert rules.

Firetiger operates in this category as a change-aware production monitoring platform: it reads each PR diff, generates a deployment-specific monitoring plan, watches the deployment across staging, canary, and production, detects regressions, and investigates root cause. The platform charges based on reliability outcomes rather than data volume, which aligns the vendor's incentives with the customer's goal: better production reliability rather than more data ingestion.

How do pricing models differ across observability vendors?

Pricing is often the primary driver of platform evaluations, and the differences between vendor pricing models are substantial enough to create order-of-magnitude cost differences at scale.

Datadog: per-host plus per-feature. Datadog's base infrastructure monitoring starts at $15/host/month (on-demand) or roughly $15-23/host/month on commitment plans, but this covers only basic infrastructure metrics. APM adds $31/host/month. Log management charges per ingested GB ($0.10/GB for ingestion) plus retention costs. Custom metrics are charged per metric ($0.05/custom metric/month at higher volumes). Real user monitoring, synthetic monitoring, security monitoring, and other products each add their own per-unit charges. The result is that the total cost per host can be several multiples of the base price once all the features a team actually needs are enabled. A common surprise for Datadog customers is the custom metrics bill: a team that instruments liberally (as good engineering practice suggests) can generate thousands of custom metrics that quietly inflate costs.

New Relic: per-GB plus per-user. New Relic charges $0.30/GB for data ingestion beyond a 100GB/month free tier, plus $549/user/month for full platform users (with a cheaper $99/user/month tier for limited-access users). This model is attractive for small teams with moderate data volumes but becomes expensive in two scenarios: organizations with high data volumes (common in microservice architectures) and organizations with many engineers who need full platform access. The per-user fee is particularly contentious because it creates a perverse incentive to restrict access to observability data, which is the opposite of what good incident response culture requires.

Open-source: free software, expensive operations. The direct cost of Prometheus, Grafana, and Loki is zero. The real cost is the engineering time required to run them. A production Prometheus deployment needs capacity planning, disk management, federation or remote write for long-term storage, high-availability configuration, and ongoing maintenance. Estimates vary, but organizations typically find that the equivalent of 1-3 full-time engineers' time is consumed by operating their open-source observability stack. At an average fully-loaded engineering cost of $200-300K/year, this means the "free" stack costs $200K-900K/year in engineering time, before cloud infrastructure costs for the storage and compute it runs on.

Grafana Cloud: usage-based with lower unit costs. Grafana Cloud charges per active metrics series, per GB of log data, and per trace span. The per-unit costs are generally lower than Datadog or New Relic, partly because the underlying architecture (based on open-source components optimized for cloud storage) has lower infrastructure costs. For many organizations, Grafana Cloud represents a 40-60% cost reduction compared to Datadog at equivalent data volumes.

Firetiger: outcome-based, not volume-based. Firetiger's pricing is tied to the reliability outcomes its change-aware deploy verification delivers rather than the volume of telemetry ingested. This eliminates the perverse incentive that volume-based pricing creates, where using the platform more (adding more metrics, more tags, more traces) directly increases cost. Outcome-based pricing aligns what you pay with what you get: better production reliability. It also removes the cardinality anxiety that plagues teams on per-metric platforms, because there is no cost penalty for high-cardinality data that the system needs for thorough deploy verification and investigation.

The hidden cost: cardinality. Across all volume-based pricing models, cardinality is the single largest driver of unexpected cost growth. Each unique combination of metric name and tag values creates a new time series, and each time series costs money. In Kubernetes environments, tags like pod_name and container_id generate thousands of ephemeral time series that inflate bills without providing proportional value. When comparing platforms, teams should model their cardinality profile against each vendor's pricing to get an accurate cost estimate, not just multiply hosts by the per-host rate.

What should you evaluate when switching observability platforms?

Whether you are selecting your first platform or migrating from an existing one, several evaluation criteria distinguish a decision you will live with comfortably from one you will regret.

Data portability and OpenTelemetry support. The most important architectural question is whether the platform supports OpenTelemetry (OTel) natively. OTel has become the industry standard for telemetry instrumentation, providing vendor-neutral SDKs and a collector that can send data to any compatible backend. A platform that requires a proprietary agent or SDK creates lock-in at the instrumentation layer, which is the most expensive layer to change. If your application code is instrumented with OTel, you can switch backends without re-instrumenting, which dramatically reduces migration cost and risk. Every major platform now supports OTel to varying degrees, but the depth of support varies. Check whether the platform supports OTel-native ingestion (not just a compatibility shim), whether it preserves the full richness of OTel data (including resource attributes and semantic conventions), and whether its query capabilities work well with OTel data structures.

Total cost at your scale. Vendor pricing pages are designed to look attractive. Reality often differs. The only reliable way to estimate cost is to model your actual telemetry profile -- number of hosts, custom metrics count, cardinality, log volume, trace volume, number of users -- against each vendor's pricing model. Ask vendors for detailed quotes based on your actual numbers, not estimates based on their pricing tiers. Pay particular attention to the cost of the features you actually need: a platform that is cheap for metrics but expensive for logs and traces is not actually cheap if you need all three.

Cardinality handling. How does the platform handle high-cardinality data? Some platforms charge per time series, making cardinality directly expensive. Others aggregate or sample high-cardinality data, which saves money but loses detail. Others store data in columnar formats that handle cardinality efficiently. Your choice here depends on your debugging needs: if your team regularly needs to filter by high-cardinality dimensions (specific customer IDs, request IDs, or container names) during incident investigation, a platform that penalizes cardinality will either cost more or force you to drop the data you need most.

Migration support and dual-write capability. If you are switching platforms, the transition period is the riskiest phase. The industry-standard approach is dual-writing: using an OpenTelemetry Collector to simultaneously send telemetry to both your current platform and the candidate replacement. This allows you to validate that the new platform receives data correctly, that queries return expected results, and that dashboards and alerts can be recreated, all without losing visibility in your existing system. Evaluate whether the candidate platform supports seamless dual-write ingestion and whether the vendor provides migration tooling or support. A typical dual-write evaluation period is 2-4 weeks, long enough to encounter representative traffic patterns and at least one on-call rotation.

AI and agent capabilities. The observability market is in the middle of a significant shift toward AI-driven features. Every major vendor has shipped or announced AI capabilities, but the depth and utility vary enormously. Some offer AI-powered log summarization or anomaly detection. Others provide agent-based investigation that can autonomously query data, form hypotheses, and produce root cause analyses. Evaluate whether the platform's AI features are genuinely useful for your workflows or whether they are marketing checkboxes. Ask specific questions: can the AI agent investigate an incident end-to-end without human prompting? Can it read a code change and generate targeted monitoring? Does it reduce on-call burden measurably?

BYOC option. If data sovereignty, compliance, or cost transparency are important to your organization, evaluate whether the platform offers a BYOC deployment model. Having your telemetry data stay in your own cloud account simplifies compliance, gives you direct control over storage costs, and ensures that you can access your data even if you later switch vendors. Not every organization needs BYOC, but for those that do, it is a hard requirement that eliminates many options.

Ecosystem and integration breadth. A platform that works beautifully in isolation but does not integrate with your CI/CD pipeline, incident management tools, communication platforms, or cloud provider is a platform that will live in a silo. Evaluate the breadth and depth of integrations, particularly with the tools your team uses daily. Check whether integrations are native or require third-party connectors, whether they are actively maintained, and whether they support bidirectional data flow (not just ingestion but also actions like creating incidents or triggering rollbacks).

Query experience for humans and agents. How well does the platform support ad-hoc investigation? During an incident, engineers need to slice and dice data quickly, pivot between metrics and logs and traces, and test hypotheses in real time. The query experience, including the query language, its expressiveness, the speed of results, and the ability to jump from a metric anomaly to related logs to a distributed trace, is a major determinant of mean time to resolution. Increasingly, this query experience also needs to work well for AI agents, which means supporting programmatic access, broad data exploration, and high-cardinality filtering.

The decision ultimately comes down to a small number of factors that matter most for your specific situation: cost at your current and projected scale, the operational burden you are willing to accept, the degree of vendor lock-in you can tolerate, and whether the platform's capabilities (especially around AI and automation) align with how your team actually works. No platform is best for everyone. The right choice is the one that fits your constraints, and that you can migrate away from if those constraints change.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Inventory your current telemetry profile: Measure your daily data volume (GB of logs, number of metrics, trace spans) and unique time series count.
Model cost across 3 vendors: Get pricing estimates from your current vendor, one SaaS alternative, and one open-source or BYOC option at your actual data volumes.
Set up an OTel Collector for dual-write: Send telemetry to both your current platform and a candidate simultaneously for 2-4 weeks before making a decision.
Evaluate agent and AI capabilities: If change-aware production monitoring matters to your team, test platforms like Firetiger that read PR diffs, watch deployments, detect regressions, and investigate root cause.