What are SLOs, SLIs, and SLAs?

SLOs, SLIs, and SLAs form a hierarchy: a Service Level Indicator is a measurement, a Service Level Objective is an internal target against that measurement, and a Service Level Agreement is an external contractual commitment. SLOs work best when set above SLA thresholds to provide margin. Most traditional SLO programs stall on instrumentation complexity and the fact that aggregate metrics hide per-customer pain.

Why it matters

The Google SRE book popularized SLOs as the foundation of a principled approach to reliability — and it remains one of the most under-implemented practices in the industry. Across teams Firetiger has worked with, fewer than half of services in active development have meaningfully maintained SLOs, and most of the ones that do cover the original tier-1 services without ever expanding to the long tail. The structural failure mode is usually that aggregate SLIs hide per-customer pain: a 99.9% service-wide success rate can coexist with one enterprise customer experiencing 30% errors, because their traffic is a small share of the aggregate. Per-customer SLOs and per-slice baselines are what surface this — and what gives the metric back its operational meaning.

The three concepts operate at different organizational levels. SLIs are technical measurements owned by engineering. SLOs are internal goals that align engineering effort with business priorities. SLAs are external promises that create legal and financial accountability. A well-functioning reliability program defines SLIs first, sets SLOs above the SLA threshold to provide a safety margin, and only commits to SLAs the team is confident it can meet.

The concept was popularized by Google's Site Reliability Engineering (SRE) handbook, which framed SLOs as the foundation of a principled approach to reliability. Rather than pursuing "100% uptime" -- which is both impossible and economically irrational -- teams set explicit targets that balance reliability against the speed of feature development. If you are well within your SLO budget, you can ship faster and take more risks. If you are burning through your error budget, you slow down and invest in stability. This framework has been widely adopted across the industry, though as we will see, the implementation often falls short of the theory.

Why do traditional SLO implementations stall?

Despite broad agreement that SLOs are valuable, most organizations struggle to implement them effectively. Industry surveys consistently show that while the majority of engineering teams have heard of SLOs and believe they should adopt them, only a fraction have functioning SLO programs that meaningfully influence engineering decisions. The reasons fall into three categories.

Instrumentation complexity creates a high barrier to entry. Defining a good SLI requires deep knowledge of the system being measured. Which metric accurately reflects user experience? Where is it captured? Is the data reliable? For a seemingly simple SLI like "authentication success rate," an engineer must determine which telemetry source captures authentication events, how to distinguish legitimate failures from expected rejections (like incorrect passwords), how to handle bot traffic versus human traffic, and how to account for retries. One authentication platform found that what appeared to be abnormal traffic patterns were actually caused by open browser tabs continuously refreshing sessions, generating tens of thousands of requests per day from a single user. Defining the SLI correctly required understanding this behavior -- something that took weeks of investigation to uncover.

Multiply this complexity across every service and every customer-facing workflow, and the instrumentation burden becomes daunting. Teams often start with enthusiasm, define a few SLOs for their most critical services, and then stall because extending coverage to the rest of the system requires more engineering effort than they can spare.

Organizational alignment is harder than the technical work. SLOs are only useful if they influence decisions. That requires agreement across engineering, product, and leadership on what to measure, what the targets should be, and what happens when targets are missed. In practice, different stakeholders have different priorities. One large enterprise found that its observability initiative had multiple competing champions: one focused on reducing vendor costs, another on ensuring complete log coverage, and a third on improving developer self-service. Without alignment, the SLO program became a political football rather than a reliability tool.

Even when there is initial agreement, maintaining SLOs requires ongoing attention. Targets that made sense six months ago may be too lenient or too strict as the system evolves. New features introduce new failure modes that existing SLIs do not capture. The maintenance burden compounds over time, and SLO programs that are not actively tended become stale and ignored.

Global metrics create a false sense of security. Perhaps the most insidious failure mode is an SLO program that appears healthy while real problems go undetected. This happens when SLOs are defined at the aggregate level -- measuring system-wide success rates rather than per-customer or per-workflow outcomes. The math of aggregation makes this dangerous: if 99% of your traffic comes from healthy customers and 1% comes from a customer in crisis, your global SLO will look fine even as that customer churns. The aggregate hides the individual.

What are per-customer SLOs and why do they matter?

Per-customer SLOs address the aggregation problem by scoping reliability targets to individual customers rather than the system as a whole. Instead of measuring "99.9% of all API requests succeed," a per-customer SLO measures "99.9% of API requests from Customer X succeed" -- independently for every customer.

This distinction is critical for B2B software companies where revenue is concentrated among a relatively small number of accounts. One AI inference platform discovered that enterprise customers experiencing severe issues were invisible in global metrics because their request volume was a tiny fraction of total traffic. The failing requests were hiding in the aggregate 0.01% of errors -- statistically insignificant in the global view, but devastating for the individual customer experiencing them. The customer was having a terrible experience while every dashboard showed green.

The pattern repeats across B2B companies. An infrastructure company found that partner integration errors with specific third-party services affected only a subset of customers but created cascading failures in those customers' workflows. A developer tools company discovered that certain customers were being targeted by credential stuffing attacks that were invisible in aggregate authentication metrics but caused hundreds of compromised accounts for the affected organizations. In each case, per-customer SLOs would have surfaced the problem immediately, while global SLOs remained healthy.

Per-customer SLOs also change the economics of support and customer success. Without per-customer visibility, support teams rely on customers reporting problems -- by which point the customer is already frustrated and the damage is done. Multiple B2B companies describe the same reactive workflow: a customer files a ticket, a support engineer escalates to a product engineer, the product engineer spends hours investigating across multiple tools, and eventually the root cause is identified. This process can take days. With per-customer SLOs, the engineering team knows about the problem before the customer does, and can often resolve it before it is ever reported.

The challenge with per-customer SLOs is that they multiply complexity. If you have 500 customers and 10 SLOs per customer, you now have 5,000 targets to track. Manually configuring and maintaining this is not feasible. This is one of the areas where traditional SLO approaches break down most clearly, and where automation becomes essential rather than optional.

One developer tools platform found success by combining per-customer monitoring with automated investigation. When a per-customer SLO was violated, the system automatically identified whether the problem was systemic (affecting many customers) or isolated (affecting one). For isolated issues, it could often determine the root cause -- a misconfigured OAuth integration, a rate limiting violation, a credential stuffing attack -- without human intervention. This transformed the support model from reactive ticket triage to proactive issue resolution.

How can AI agents automate SLO management?

AI agents address the three problems that cause traditional SLO implementations to stall: instrumentation complexity, organizational alignment, and maintenance burden. Tools like Datadog SLOs, Nobl9, and similar platforms have made it easier to define and track SLOs, but the underlying labor of selecting indicators, setting targets, and maintaining them over time remains largely manual. They do this by automating the labor-intensive work that previously required specialized human expertise at every step.

Translating business language into measurable indicators. The first bottleneck in any SLO program is defining what to measure. This requires someone who understands both the business intent ("customers should be able to authenticate reliably") and the technical implementation (which logs capture authentication events, how success is defined, what edge cases exist). AI agents can bridge this gap by accepting natural language descriptions of desired outcomes and automatically selecting appropriate telemetry signals, writing the queries to compute the indicator, and establishing baseline targets from observed data.

Firetiger, for example, translates natural-language outcome descriptions into concrete SLIs and continuously evaluates them — and the same Indicators serve as the inputs for change-aware deploy verification, since a Change Monitor verdict is grounded in the same telemetry that drives the SLO. Given the instruction "monitor for errors affecting users across deployments," the system can analyze available log and metric sources, identify authentication endpoints, compute current success rates, and propose an SLO with a target informed by historical behavior. The agent selects reliable indicators from available data, establishes targets based on observed baselines, and adjusts its approach as the system evolves. This removes the instrumentation barrier that stalls most manual SLO initiatives.

Continuously evaluating and adapting. Traditional SLOs are set and then slowly decay as systems change. New services are deployed, traffic patterns shift, customer usage evolves, and the original SLI definitions become less relevant. AI agents can continuously re-evaluate whether their measurements are still meaningful. If a new deployment changes the log format or adds a new authentication path, the agent can detect the change and update its queries accordingly. This turns SLO maintenance from a periodic human burden into a continuous automated process.

This adaptive capability is particularly important for fast-moving teams that ship frequently. One code generation platform needed to track build health across multiple programming languages, with new languages and configurations added regularly. Static SLO definitions would have required manual updates for each change. An automated system could detect schema changes and adapt its monitoring strategy while maintaining durability for persistent issues.

Ranking issues by business impact. Once you have per-customer SLOs evaluated automatically, the next challenge is prioritization. A system monitoring hundreds of customers will inevitably detect multiple issues simultaneously. Without automated ranking, this creates the same alert fatigue problem that plagues traditional monitoring. AI agents solve this by attaching business context to every detected violation.

Consider a concrete example: an agent monitoring authentication health detects a JWT secret loading failure. Rather than simply generating an alert, the agent quantifies the impact -- fixing the issue would eliminate approximately 310 authentication failures over a seven-day window and reduce the number of affected customer organizations from six to zero. This gives the engineering team an immediate understanding of both the severity and the scope, enabling them to make informed prioritization decisions without manual investigation.

This impact quantification also bridges the organizational alignment gap. When an SLO violation comes with a clear statement of business impact -- "six customer organizations are experiencing authentication failures, resulting in approximately 310 blocked requests over the past week" -- it is much easier to get cross-functional agreement on priority than when the alert simply says "auth error rate elevated."

Enabling proactive customer communication. Perhaps the most transformative capability of automated SLO management is shifting from reactive support to proactive notification. Multiple B2B companies report that their customers often discover problems before internal engineering teams do. This erodes trust and creates a dynamic where the customer success team is constantly firefighting.

With AI agents continuously evaluating per-customer SLOs, the organization can detect issues before customers notice them and communicate proactively. One approach draws a parallel to how utility companies notify customers about service disruptions: the customer receives a message explaining the issue, the expected resolution time, and any workaround -- before they experience the problem themselves. This is only possible when the monitoring system understands outcomes at the individual customer level, which is the foundation that per-customer SLOs provide.

Automated SLO management does not replace human judgment. Engineers still define what outcomes matter, set policies for how violations should be handled, and make architectural decisions that shape system reliability. What changes is the operational burden: the repetitive work of selecting indicators, writing queries, establishing baselines, monitoring continuously, investigating violations, and quantifying impact shifts from human engineers to automated agents. The engineers focus on the system design and policy decisions that only humans can make, while the agents handle the continuous operational work that scales beyond what any human team can maintain.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Pick one critical user journey: Choose the most important user-facing flow (e.g., login, checkout, API call) and define an SLI for it.
Set an SLO with an error budget: Commit to a target (e.g., 99.9% over 30 days) and calculate how much failure that budget allows per month.
Measure per-customer, not just globally: If you're B2B, break your SLI down by customer to see if aggregate numbers are masking individual customer pain.
Automate SLO tracking: Deploy agent-driven SLO management (e.g., Firetiger) that translates business-language outcomes into measurable indicators and continuously evaluates them.