What are domain-specific languages (DSLs) for AI agents?

A domain-specific language (DSL) is a purpose-built language for a particular problem domain. In the context of AI agents, DSLs provide a constrained, expressive interface — enough power to accomplish a task, with hard guarantees that prevent destructive or wasteful operations. DSLs outperform raw SQL or Python for agent tasks because the parser itself becomes a safety boundary.

Why it matters

The interface between an agent and the systems it operates on is a trust boundary, and trust boundaries deserve purpose-built tools. In the systems Firetiger has built, replacing a raw-SQL agent interface with a constrained DSL reduced both the rate of destructive operations (effectively to zero) and the agent's average context burn on tool output by roughly half — the parser refuses ambiguous or oversized queries before they reach the database. Just as you would not give a database administrator the root password to every server in your infrastructure, you should not give an AI agent an unconstrained programming language when a constrained one would suffice. DSLs are how you scope agent capabilities to exactly what's needed.

DSLs are everywhere in software: SQL is a DSL for querying relational data, regular expressions are a DSL for pattern matching, CSS is a DSL for styling web pages. What they share is a deliberate trade-off: they give up generality in exchange for expressiveness and safety within their specific domain. In the agent context, that trade-off is doubly valuable: the parser itself becomes a safety mechanism, and the smaller surface area is also a smaller cognitive load for the LLM.

Why not just let agents write SQL or Python directly?

The most immediate reason is security. General-purpose languages contain capabilities that are dangerous in the hands of an unsupervised automated system. SQL, for example, includes DROP TABLE, DELETE without WHERE clauses, and database-specific functions that can access the filesystem or execute system commands. Python can open network connections, read and write arbitrary files, and import any library installed on the system. Even if an agent is well-intentioned (or, more precisely, well-prompted), the attack surface of a general-purpose language is enormous. A prompt injection, an unexpected edge case, or a misinterpretation of context could lead to an agent executing a destructive operation.

DSLs enforce hard constraints at the language level. A query language designed for agents can simply omit destructive operations from its grammar. If the language does not have a DROP TABLE statement, the agent cannot drop a table, regardless of what its prompt says or what input it receives. This is a fundamentally stronger guarantee than prompt engineering or output filtering, because it operates at the parser level rather than the inference level. The agent physically cannot express a dangerous operation because the language does not contain the vocabulary for it.

Token efficiency is the second compelling reason. Large language models process text as tokens, and every token consumed by tool descriptions, API schemas, or language syntax is a token not available for reasoning. General-purpose approaches to agent capabilities often involve defining dozens of individual tools with extensive parameter lists: list_issues, list_issues_for_user, list_issues_for_organization, each with its own documentation. This approach wastes tokens on permutations that could be expressed more concisely. A DSL compresses intent into fewer tokens: instead of choosing among twenty specialized tools, the agent writes a single expression in a purpose-built language. The result is more room for reasoning in the model's context window and faster, cheaper inference.

Reliability is the third advantage. When an agent writes code in a general-purpose language and makes a syntax error, the error message from the Python interpreter or SQL engine may be cryptic, verbose, or misleading. DSL parsers can be designed to produce clear, actionable error messages that help the agent self-correct. Because the language is small and purpose-built, the parser has enough context to say "you used FILTER BY but the correct keyword is WHERE" rather than a generic syntax error. This self-correction loop is critical for agent reliability: agents that can recover from errors gracefully are dramatically more useful than agents that fail silently or require human intervention to get back on track.

There is also a more subtle advantage: DSLs create a well-defined surface for testing. Because the language is constrained, you can enumerate its capabilities and write comprehensive tests. You can fuzz the parser with random inputs to verify that it handles malformed code gracefully. You can prove properties about what the language can and cannot express. None of this is feasible with a general-purpose language, where the space of possible programs is effectively infinite.

How do you build and test a DSL for agents?

Building a DSL starts with the domain, not the syntax. The first question is: what operations does the agent need to perform? If the agent needs to query a multi-tenant data lake, the operations might be filtering, aggregation, time-range selection, and field projection. If the agent needs to configure notification routing, the operations might be condition matching, destination selection, and priority assignment. The domain determines the vocabulary; the vocabulary determines the grammar.

The standard architecture for a DSL has three layers: a lexer, a parser, and an evaluator. The lexer breaks raw text into tokens (keywords, operators, identifiers, literals). The parser arranges tokens into a structured representation, typically an abstract syntax tree (AST). The evaluator traverses the AST and does the actual work -- executing a query, applying a configuration change, or generating output.

This architecture might sound heavyweight, but in practice, each layer can be quite compact. One production implementation of a constrained query language used approximately 300 lines for the lexer (a hand-written, single-pass, byte-by-byte tokenizer) and 5,000 lines for the parser (a recursive descent parser that builds an AST). The evaluator handled predicate pushdown, query rewriting, permissions enforcement, and optimization before transpiling to an existing query engine. The total investment was substantial but not extraordinary, and it produced a language that was meaningfully safer and more efficient than the alternatives.

A key design decision is whether to build a query engine from scratch or transpile to an existing one. In almost all cases, transpiling is the right choice. The world does not need another SQL execution engine. What it needs is a constrained front-end that validates, restricts, and optimizes queries before passing them to a mature execution backend. For example, a custom query language might parse agent-written queries, enforce that every query includes a time-range filter (preventing expensive full-table scans), rewrite dynamic function calls into static literals for better optimization, and then emit standard SQL for execution by an established engine like DuckDB or Postgres. This approach leverages decades of optimization work in the underlying engine while adding the safety and expressiveness layer that agents need.

Testing is where DSLs truly shine compared to general-purpose approaches. Parsers are ideal candidates for test-driven development: the inputs are strings, the outputs are structured data, and the boundary between valid and invalid inputs is precisely defined. You can write hundreds of test cases covering normal usage, edge cases, and explicitly invalid inputs with minimal effort. Fuzz testing is particularly valuable because it exercises the parser with random, potentially adversarial inputs -- exactly the kind of input an agent might produce when confused or when processing unexpected data. If the parser handles fuzzed input gracefully (returning clear errors rather than crashing or producing undefined behavior), you can have high confidence in its robustness.

The testing strategy extends beyond the parser to the trust boundary itself. Every DSL defines what an agent can and cannot do. The test suite should verify both sides: that the agent can express all operations it legitimately needs, and that it cannot express any operation that would be dangerous. This dual verification -- testing for both capability and safety -- is the foundation of trust in autonomous agent systems.

LLMs are, perhaps ironically, excellent tools for building DSLs. Writing a recursive descent parser is tedious, repetitive work that follows well-established patterns. It is exactly the kind of code that LLMs generate reliably, because massive code corpora contain thousands of parser implementations. The human designer defines the grammar and the constraints; the LLM generates the implementation. This partnership -- human judgment on what the language should do, machine productivity on implementing the mechanics -- is remarkably effective.

An interesting pattern that emerges in practice is that DSLs evolve based on agent behavior. When agents attempt operations that the DSL does not support, those attempts provide signal about capabilities the DSL might need to add. Firetiger uses two production DSLs for its agents: Confit SQL, a restricted query language that compiles to DuckDB SQL with enforced multi-tenant isolation, and a resource filtering language based on Google's AIP-160 standard. Both were built with LLM-generated recursive descent parsers and tested with fuzzing. In developing these languages, Firetiger found that agents naturally attempted to use wildcard matching and array-valued field filtering before those features existed. Rather than dismissing these as errors, the team treated them as feature requests and expanded the language accordingly. This feedback loop -- observing what agents try to do and refining the DSL to support legitimate use cases -- produces languages that are well-matched to actual agent needs.

The broader trend is clear: as organizations deploy more AI agents with more operational authority, the interfaces between those agents and the systems they manage will increasingly be purpose-built DSLs rather than general-purpose languages. DSLs provide the right balance of expressiveness and constraint, they are testable in ways that general-purpose code is not, and they create the kind of well-defined trust boundaries that make autonomous operation safe. For teams building agent-powered systems, investing in DSL design is investing in the foundation that makes everything else possible.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Identify your riskiest agent-accessible operations: Determine which systems agents interact with where a mistake could cause damage (databases, infrastructure, APIs).
Define the operations agents actually need: List the specific queries and actions agents perform -- this becomes the scope of your DSL.
Build a parser, not just validation: Use a proper lexer/parser/evaluator architecture rather than regex-based validation, so agents get clear error messages for self-correction.
Test with fuzzing: Generate thousands of random inputs to verify your DSL parser correctly rejects dangerous operations and handles edge cases.