Your Evals are your Agent
Your Evals are your Agent
Over the last 7 to 9 months, the most interesting software engineering work I've done has been building an AI agent eval system, and it's also been some of the most rewarding work of my career. I want to share a few ideas from that experience, nothing proprietary, just the parts that I think actually matter.
Definitions
These terms are still a bit fluid, so this is just how I'm using them here.
An AI Agent is an LLM-based system that can take a task in some domain and run with it on its own. In practice these tasks are short, ranging from seconds to minutes and sometimes up to an hour. That will probably change, but that's where things are today.
An Agent Eval is a way to score how well the agent actually did the thing you care about. This is different from model evals, which are about general capability. Agent evals are about whether this system actually worked in the real use case.
The Core Idea
Most teams think they are building AI agents, but in reality they are constantly tweaking prompts, model choices, how tools are wired together, and the overall flow. All of that changes. The only stable piece is what "good" means, and that lives in the eval. The eval defines success, decides what ships, and shapes every change you make. Your eval is your agent.
What We Were Actually Iterating On
Our system changed constantly: prompts changed, flows changed, and even the structure of the agent changed. But that's not really what we were iterating on, we were iterating toward domain outcomes. That's the stable part. Prompts, models, and architectures all change, but outcomes don't. Once you have a clear definition of a good outcome, everything else becomes trial and error in service of that.
The Common Mistake
Most teams start by messing with prompts because it feels like progress, but it usually isn't. Without an eval you don't actually know if things are getting better, you just know the output is different.
Even with a PM who really understands the domain, jumping straight into prompts is a mistake. That intuition is far more useful when it's applied to defining success, defining failure, and deciding what tradeoffs matter. In other words, building the eval.
A Concrete Example (Thought Exercise)
Imagine a simple research agent: you give it a question and it uses the internet to return an answer. Something like "Find 3 companies in X space, compare pricing, and recommend one." Not a crazy task, just something a human could do in about 30 minutes. The agent searches, reads a few pages, pulls out key info, and returns a structured answer.
At first glance this kind of system looks pretty good. It writes clean summaries, sounds confident, and even cites sources, so the natural move is to tweak prompts. Tell it to be more structured, ask for clearer reasoning, add rules about citing sources. The outputs improve, at least visually, becoming more organized, more detailed, and more professional. But you still don't actually know if it's better.
So you build an eval. Keep it simple: a small set of questions with expected outcomes. Not exact answers, but a handful of checks like:
- Did it find reasonable candidates?
- Did it correctly compare the key attributes?
- Did it miss any obvious options?
- Were the sources actually relevant?
For each run you assign a simple score, mostly binary checks with a bit of weighting:
- Correct candidates found: 0 or 1
- Key comparison included: 0 or 1
- Irrelevant sources: small penalty
Then you average across a dataset. Nothing fancy, just enough to turn "this looks good" into a number you can track. The hard part isn't the scoring setup, it's defining the metrics that actually align with your intended outcome.
This is where it gets unintuitive. Some prompt changes that made the output look much better actually hurt performance: the agent would over-index on formatting, sound more confident than it should, and include weak or irrelevant sources just to fill structure, and the eval score went down. Other changes that made the output look worse at a glance actually improved the score, with less polished but more direct answers, better choices, better comparisons, and fewer mistakes.
That's when it clicks. You stop asking "does this look good?" and start asking "did this actually work?" At that point you're not really iterating on the agent, you're iterating on the eval.
The TDD Analogy
The closest analogy I've found is TDD: the eval is the tests, and the agent is the implementation. You're not really building an agent, you're building something that can pass the eval.
I'm not sure you can or should fully define the eval before building anything, and pure TDD for agents might be too rigid. But moving in that direction, thinking eval-first, has been one of the highest leverage shifts for us.
What Actually Compounds
Models will get better, prompts will change, and architectures will get rewritten. What actually compounds is your understanding of what good looks like, and that gets captured in the eval. Once you have that, you can swap models, change flows, and move faster without guessing. The eval is the only part of the system that persists.
Closing
The evaluation framework is the valuable part. It's the product, and everything else is interchangeable. Your eval is your agent.