What is agent engineering?

Agent engineering is the discipline of designing, building, and operating AI agents that use tools in loops, maintain state across long-running sessions, and operate with varying autonomy. The biggest levers are tool design (token efficiency, generic over specialized tools), domain-specific languages that give agents safe expressive power, and immutable-snapshot state management for sessions that run for hours or indefinitely.

Why it matters

Agent engineering sits at the intersection of prompt engineering (crafting model inputs) and traditional software engineering (building deterministic systems): you are building software where a core component is non-deterministic, and the system's behavior emerges from the interaction between model reasoning, tool design, and environmental feedback. Across the agents Firetiger has built and operated, the dominant performance lever has been tool design — not model choice. Reducing a noisy tool's typical output from 40,000 tokens to 4,000 routinely produces step-change improvements in agent reliability and end-to-end latency. The engineering challenge is not making the model smarter; it is designing everything around the model so that it can be effective, safe, and reliable in the loop.

Three areas form the foundation of agent engineering: how you design the tools agents use, the languages you give them to express their intent, and how you manage context and state for agents that run for minutes, hours, or indefinitely. Each of these areas has its own set of hard-won patterns and trade-offs.

How should you design tools for AI agents?

Tool design is the single most impactful lever in agent engineering. Agents are, at their core, models with tools in a loop. The quality of those tools determines the ceiling of what the agent can accomplish.

The first principle of tool design is token efficiency. LLMs have finite context windows, and every token consumed by tool output is a token unavailable for reasoning. Poorly designed tools that blast 40,000 tokens of raw log output into the context window are not just wasteful -- they are actively harmful. The model loses track of its objective, fixates on irrelevant details in the massive output, and produces worse results than if it had received less information.

There are two primary approaches to managing large tool results, and the choice between them has significant performance implications.

Approach 1: Truncation. Cap tool results at a fixed token budget -- around 5,000 tokens is a practical ceiling. When results exceed this limit, truncate them but always inform the agent that truncation occurred. This notification is critical: it tells the agent that the data is incomplete and prompts it to reformulate its query with filters, aggregations, or narrower time ranges. Placing the truncation notice before the truncated results (rather than after) improves model performance, because the model encounters the caveat before processing the partial data.

Truncation works well enough for many cases, but it has a fundamental limitation: the agent is guessing blindly about what was cut. It knows the data was reduced but has no way to efficiently explore the full result set.

Approach 2: Stored artifacts with query tools. Instead of truncating results, store the full output as an artifact in durable storage and return a reference ID to the agent. Then provide the agent with query tools -- such as jq filters for JSON data -- that let it navigate, filter, and aggregate the full result set without loading it all into context.

For example, an agent investigating error patterns might first retrieve a large set of log entries that gets stored as an artifact. The agent can then run queries against that artifact: .[] | select(.error != null) to filter for errors, group_by(.service) | map({service: .[0].service, count: length}) to aggregate by service, or map(.attributes.service_name) | unique to extract distinct values. Each query returns only the specific slice of data the agent needs.

The performance difference is dramatic. One engineering team found that switching from truncation to the artifact-with-query approach produced a 5x improvement in median response time, dropping from 31 seconds to 6 seconds. The approach achieved 94% one-shot success rates -- meaning the agent got to a correct, useful answer on its first attempt -- with zero parsing errors across thousands of query invocations. The improvement comes from two sources: the agent spends far fewer tokens on raw data, and it spends far fewer reasoning steps reformulating queries after encountering truncated results.

A second principle of tool design is favoring fewer, more general tools over many specialized ones. A small set of generic tools that the agent deeply understands will consistently outperform a long list of specialized tools that the agent must choose between. Standard CRUD operations (get, list, create, update, delete) applied uniformly across all resource types, following conventions like Google's API Improvement Proposals (AIP), give the agent a consistent mental model. When every resource works the same way, the agent does not waste reasoning cycles figuring out which tool to use or how a particular tool's interface differs from another.

Why do domain-specific languages make agents more effective?

Domain-specific languages (DSLs) -- small, purpose-built languages designed for a particular problem domain -- are one of the most underutilized tools in agent engineering. They provide a way to give agents expressive power within carefully defined boundaries.

The insight is that LLMs learn specialized syntax quickly. You do not need to train a model on a custom language; you provide documentation and examples in the system prompt, and the model begins writing valid expressions in that language with high reliability. This makes DSLs practical in a way they were not before: the cost of creating a new language used to include teaching it to every developer who would use it, but now the primary consumer is an LLM that adapts in-context.

Why not just give agents SQL? SQL is powerful, but it is also dangerous. An agent with raw SQL access can construct queries that scan terabytes of data, drop tables, or exfiltrate sensitive information. The traditional mitigation -- wrapping SQL in an ORM or sanitizing inputs -- is fragile when the input comes from an LLM that is creative by design.

A DSL solves this by transpilation with safety guarantees. You design a constrained query language that expresses the operations your agent legitimately needs, build a parser that enforces structural rules, and transpile valid expressions into the underlying query language (such as SQL). The parser acts as a safety boundary: if the agent generates an expression that does not conform to the grammar, it is rejected before execution, and the error message helps the agent correct its syntax.

For example, Firetiger built two DSLs for its agents: Confit SQL, a constrained query language that transpiles to DuckDB SQL with mandatory tenant isolation and time-range filtering; and a filtering language based on Google's AIP-160 standard for querying platform resources. Confit SQL forbids dangerous SQL functions using an allowlist approach, dynamically rewrites FROM clauses to enforce data access controls, and performs predicate pushdown optimization for time-partitioned data. The implementation consists of a lexer (roughly 300 lines of code), a parser (roughly 5,000 lines), and an evaluator that transpiles valid queries into DuckDB SQL. The agent writes queries in the constrained language, and the system guarantees that the resulting SQL is safe and efficient.

The AIP-160-based filtering language converts simple filter expressions into escaped SQL WHERE clauses, supports wildcards and array-valued fields, and evolved over time based on what agents naturally attempted to do. When an agent tried to express a filter that the language did not support, the failed attempt signaled a genuine need, and the language was extended.

LLMs are also remarkably effective at writing the parsers themselves. Recursive descent parsers -- the most common architecture for DSLs -- are tedious to write by hand but highly structured and well-suited for AI generation. Test-driven development fits naturally: you define the expected parse tree for a set of input expressions, and the model generates parser code that satisfies the tests. Fuzzing can then exercise the parser with random inputs to establish trust in the safety boundary.

The broader principle is that DSLs replace sprawling tool parameter lists with structured, parseable intent. Instead of a tool with fifteen optional parameters for filtering, sorting, and aggregating data, you give the agent a small language that expresses the same operations more naturally and safely.

How do you manage context and state for long-running agents?

Short-lived agents -- those that handle a single question-and-answer exchange -- can hold their entire state in the LLM's context window. But agents that run investigations lasting minutes or hours, or agents that operate continuously monitoring production systems, need a different approach to state management.

The fundamental challenge is that an LLM invocation is stateless. The model does not remember previous calls. Any continuity across turns must be explicitly managed by the system surrounding the model. For a long-running agent, this means designing a session engine that maintains state across an arbitrary number of steps.

One effective architecture treats agents as ephemeral compute transforming immutable state snapshots, inspired by how Git manages repository state. In this model, each conversation unit -- a user message, an assistant response, a tool call, a tool result -- is a discrete object stored in content-addressable storage. A session's state at any point is represented by an ordered array of descriptors referencing these objects. Replaying the objects in order recreates the full session state.

The execution cycle works as follows. A trigger creates a new session with a system prompt, configured tool connections, and an initial message -- this is snapshot 1. An event notification invokes a compute function, which loads the snapshot, replays the state, executes one work cycle (one or more LLM calls with tool use), and writes a new snapshot. The cycle repeats until the agent reaches a terminal state.

The key property of this design is that every invocation is a pure function: snapshot in, snapshot out. Agent instances are ephemeral and can execute anywhere. This yields several important benefits.

Crash recovery. Because state persists independently of the compute process, a crashed invocation loses no work. The next invocation picks up from the last committed snapshot. There is no in-memory state to reconstruct or lose.

Concurrent resolution. When multiple invocations compete -- perhaps triggered by overlapping events -- the system uses atomic conditional writes at the storage layer. Only one producer can write snapshot version N+1; losing invocations detect the conflict and retry against the updated state. The object store becomes the coordination primitive, eliminating the need for distributed locks or consensus protocols.

Safe retry. Because each invocation is a pure state transition, retries are safe by construction. A failed invocation can be retried without risk of duplicating side effects, because the snapshot it would have produced either was written (success) or was not (safe to retry).

This architecture keeps sessions small even over long investigations. A session might grow to a few hundred objects, but the snapshot itself contains only descriptors -- lightweight references rather than the full content of every message and tool result.

Sandboxed execution is the other critical component of long-running agent infrastructure. When agents can execute arbitrary commands -- running scripts, querying databases, calling APIs -- the execution environment must be genuinely isolated. This means more than Docker containers. Effective sandboxing uses Linux kernel namespaces for network and filesystem isolation, routes all outbound traffic through a local proxy that enforces domain-level and operation-level access controls, and uses per-session TLS certificates for full visibility into encrypted traffic. The agent gets a real bash environment where it can write and execute code, but every action passes through security boundaries that prevent lateral movement, data exfiltration, or resource abuse.

The combination of immutable state management and sandboxed execution creates a foundation where agents can run for extended periods -- investigating complex incidents, monitoring system behavior across deployment cycles, or performing multi-step remediation tasks -- without the brittleness that typically plagues long-running software systems.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Audit your current tool interfaces: If you're building agents, review whether your tools return manageable amounts of data or blast thousands of tokens per call.
Implement artifact-based tool results: Instead of returning large datasets directly, store results and give agents query tools (like jq filters) to explore them.
Consider a DSL for your highest-risk operations: If agents interact with databases or infrastructure, build a constrained query language rather than giving raw SQL access.
Study existing implementations: Look at how platforms like Firetiger designed their agent tooling (Confit SQL, AIP-160 filtering) for practical examples of production DSLs.