Learning Center
A free educational resource for engineering leaders, SREs, and platform engineers who want to deepen their understanding of modern software reliability and operations.
Outcome Engineering
Defining, measuring, and achieving software reliability outcomes in terms that matter to users and the business.
- What are SLOs, SLIs, and SLAs?SLOs, SLIs, and SLAs form a hierarchy for measuring and committing to software reliability. Learn how they work, why traditional implementations stall, and how per-customer SLOs and AI agents change the equation.
- What are agent-driven operations?Agent-driven operations use autonomous AI agents to observe, investigate, and triage production issues without constant human direction. Learn how the shift from reactive alerting to proactive agent monitoring works.
- What is high-cardinality data in observability?High-cardinality data contains dimensions with many unique values, like customer IDs or trace IDs. Traditional observability tools struggle with it, but modern columnar architectures handle it efficiently.
- What is observability and how is it different from monitoring?Observability lets you ask arbitrary questions about your systems using telemetry data. Learn the difference between observability and monitoring, the three pillars of telemetry, and why dashboards alone are not enough.
- What is outcome engineering?Outcome engineering is the practice of defining desired software outcomes and using automated systems to continuously achieve them, moving beyond passive observability toward active reliability.
- What is per-customer observability?Per-customer observability monitors system behavior at the individual customer level rather than in aggregate, helping B2B SaaS companies detect issues that global metrics miss.
Change Management
Shipping code safely and quickly as AI-assisted development accelerates the pace of change.
- How does AI-assisted development change deployment risk?AI coding agents like Claude Code, Cursor, and OpenAI Codex accelerate PR volume 3-10x, but the code they produce may lack deep human review. Learn how deployment risk changes when AI writes the code and what safeguards keep teams shipping safely.
- What is automated rollback?Automated rollback reverts a deployment when monitoring detects it is causing harm, without requiring human intervention. Learn when to use it, when to avoid it, and the prerequisites for doing it safely.
- What is change failure rate?Change failure rate measures the percentage of deployments that cause production failures. Learn how to measure it accurately, why subtle failures are easy to miss, and how to reduce it without slowing down.
- What is change management in software engineering?Change management controls how code changes move from development to production, balancing deployment velocity with incident risk. Learn why it is getting harder and what mature processes look like.
- What is deployment monitoring?Deployment monitoring is automated, context-aware observability that activates when new code reaches production. Learn how it differs from traditional APM and why it helps teams ship faster.
- What is a progressive rollout?A progressive rollout deploys changes to increasingly larger segments of users, starting with the lowest-risk group. Learn common strategies, why most teams fail, and how AI agents can close the gap.
- What is release verification?Release verification confirms that a deployed change is functioning correctly and not causing regressions. Learn why manual verification is unsustainable at scale and what automated verification should check.
Incident Response
Systematically detecting, investigating, and resolving production incidents.
- What is a postmortem?A postmortem is a structured review conducted after an incident is resolved, focused on understanding what happened, why, and how to prevent recurrence. Learn what makes postmortems effective and how to build a learning culture around them.
- What is alert fatigue?Alert fatigue is the desensitization that occurs when teams receive too many alerts, many of which are low-priority or false positives. Learn what causes it, its consequences, and practical strategies to reduce it without missing real issues.
- What is incident response?Incident response is the structured process of detecting, triaging, investigating, and remediating production issues. Learn how modern teams handle incidents and how AI agents are transforming the practice.
- What is mean time to recovery (MTTR)?Mean time to recovery (MTTR) measures how quickly an organization restores service after an incident. Learn what drives slow recovery, the related metrics MTTD and MTTI, and systematic approaches to reducing MTTR.
- What is root cause analysis?Root cause analysis (RCA) is the systematic process of identifying the fundamental cause of an incident. Learn why it is time-consuming, how agents can automate it, and what makes a good RCA.
AI Agents for Operations
Autonomous AI agents that observe, reason, and act on production environments.
- What are Agent SLOs?Agent SLOs are service level objectives that AI agents define, evaluate, and act upon autonomously, translating business-language intent into measurable metrics. Learn how they work and how they differ from traditional SLO tools.
- What are AI agents for software operations?AI agents for operations are autonomous software systems that use large language models and tooling to observe, analyze, and act on production systems continuously. Learn how they work, what they can do today, and how they differ from chatbots and copilots.
- What are domain-specific languages (DSLs) for AI agents?Domain-specific languages (DSLs) are purpose-built programming languages that give AI agents constrained, expressive interfaces for specific problem domains. Learn why DSLs outperform raw SQL or Python for agent tasks, and how to build and test them.
- What is agent engineering?Agent engineering is the discipline of designing, building, and operating AI agents that interact with software systems. Learn about tool design, domain-specific languages, and context management for long-running agents.
- What is autonomous remediation?Autonomous remediation is the practice of using automated systems to detect, diagnose, and fix production issues without human intervention. Learn about the trust spectrum, prerequisites for safety, and what kinds of issues can be autonomously remediated today.
Observability Architecture
Infrastructure patterns, data formats, and design tradeoffs behind modern observability.
- How do you choose an observability platform?A practical guide to evaluating observability platforms across SaaS, open-source, managed, BYOC, and agent-driven categories. Compare pricing models, migration strategies, and key evaluation criteria.
- What is a data lake for observability?Learn how data lakes store telemetry on low-cost object storage in open formats, offering flexible and affordable alternatives to traditional observability databases.
- What is Apache Iceberg?Understand Apache Iceberg, the open table format that enables SQL access to data on object storage, and why it matters for observability and real-time analytics.
- What is BYOC (Bring Your Own Cloud) observability?Understand the BYOC deployment model for observability, where your telemetry data stays in your own cloud account while a vendor manages the platform.
- What is observability cost optimization?Learn why observability costs spiral out of control and how engineering teams can reduce spend by 60% or more through metric auditing, cardinality management, data tiering, and architectural shifts.
- What is OpenTelemetry?Learn how OpenTelemetry provides a vendor-neutral standard for instrumenting, collecting, and exporting traces, metrics, and logs across your entire stack.
Database Operations
Monitoring database performance, diagnosing bottlenecks, and maintaining operational health.
- What is autonomous database management?Learn how AI agents continuously monitor, diagnose, and optimize database health using the observe-triage-act methodology, reducing the need for dedicated DBAs.
- What is database performance monitoring?Understand which database metrics to track and how proactive monitoring prevents production incidents caused by query regressions, bloat, and capacity issues.