Agent Evaluation

Domain-specific test suites and eval-driven development practices that validate AI agent output before it reaches production.

Last updated: 2026-04-23

Overview

Generic NLP metrics (BLEU, ROUGE) don’t work for domain-specific agents. A response can be semantically similar to the correct answer but contain completely wrong numbers. In high-stakes domains like financial services, a single wrong figure destroys user trust permanently.

The Fintool approach: maintain ~2,000 domain-specific test cases across known failure categories. Every model, prompt, or skill change gets evaluated against this set. PRs are blocked if eval score drops more than 5%.

Key Points

  • Eval-driven development: every skill has a companion eval suite. The DCF skill has 40 test cases covering WACC edge cases, terminal value sanity, and SBC add-backs (models forget this constantly)
  • Domain-specific failure categories at Fintool:
    • Ticker disambiguation: “Apple” → AAPL not APLE; “Delta” → airline or options term; handles ticker renames (FB → META Oct 2021)
    • Fiscal period resolution: 200+ test cases for period extraction; “last quarter” on Jan 15 means different things for Apple, Microsoft, and calendar-year companies
    • Numeric precision: $4.2B vs $4,200M vs 4.2 (missing units = fail); unit inference, magnitude normalization, currency handling
    • Adversarial grounding: inject fake numbers into context (fake analyst report: “Apple revenue 94B). 50 hallucination-resistance test cases
  • Braintrust used for experiment tracking and eval result storage
  • Building eval datasets is harder than building the agent — this is the actual moat, not the model calls
  • Eval progression pattern (observed across multiple AI product teams by Teresa Torres): teams universally start simple and graduate to sophistication — Google Sheets with thumbs up/down → manual comparison in Docs → code assertions → LLM-as-judge. Starting with spreadsheets is fine; the important thing is starting at all. See teresa-torres
  • LLM positivity bias: models stretch insufficient data into confident answers rather than saying “I don’t know.” Combat via architectural uncertainty management — have the model critique its own understanding, restrict it to provided context only, and explicitly test refusal on out-of-scope queries
  • Auto-filed GitHub issues: production errors automatically create issues with conversation ID, traceback, and links to Braintrust traces and Temporal workflows

Model Jaggedness: The Verifiable Domain Problem

Karpathy’s framing of why models behave inconsistently: they improve only where RL training has verifiable ground truth. Outside that boundary, they’re stuck — optimized on whatever data happened to be in training, never updated.

Canonical example: “Why don’t scientists trust atoms? Because they make everything up.” Every state-of-the-art model gives this same joke, unchanged from 5 years ago, despite massive capability improvements in code generation. The joke is outside the RL loop, so it doesn’t improve.

The vivid description: “I feel like I’m talking to an extremely brilliant PhD student who’s been a systems programmer their whole life — and a 10-year-old. Simultaneously.” Humans have some jaggedness too, but far less than models. In humans, high capability in one domain usually correlates with general capability. In current models, you can be on rails (verifiable domain → near-superhuman) or off rails (unverifiable → meanders) with a sharp discontinuity.

Practical implication for agent evaluation: if you can’t define a metric, you can’t improve it. This is also why auto-research works for hyperparameter tuning (clear loss metric) but fails for open-ended product decisions (no ground truth). Build evals before building skills.

The Verifier’s Rule Applied to Agent Loops

Jason Wei’s verifier’s rule (coined for model training) also governs agent runtime behavior: if a task is solvable and easy to verify, you can run an agent in a loop with feedback and it will eventually get there. Verifiability isn’t just a training signal — it determines how reliably an agent can self-correct in production.

Tasks within the same domain sit at different verifiability positions. Legal example (Lauritzen, Legora):

  • Checking definitions in a contract → mechanically verifiable → agents excel
  • Writing a contract → hard to verify (only courts determine if the language holds) → agents produce drafts, need human review
  • Litigation strategy → unverifiable (5 lawyers, 5 answers) → near-human judgment required

Three moves to handle low-verifiability tasks:

  1. Proxy verification: when direct ground truth isn’t available, find a proxy. Compare new contracts against known-good golden contracts as a stand-in for correctness.
  2. Decompose: break the hard-to-verify task into several verifiable subtasks (formatting, definition linting, clause structure) that can be checked mechanically.
  3. Guardrails: constrain what the agent can do, so the verifiable subset is what runs.

See agent-human-collaboration for the full trust/control framework that builds on this.

Connections

  • context-normalization — bad normalization (wrong fiscal period, garbled table) is what evals catch
  • claude-code-skills — each skill gets its own companion eval suite
  • agent-sandbox — agents run inside sandboxes during eval too
  • auto-research — auto research is only viable where evals are objective; same principle
  • agent-human-collaboration — verifier’s rule applied to agent loops; proxy verification and decomposition strategies
  • andrej-karpathy — model jaggedness framing; verifiable vs. unverifiable domain insight
  • jacob-lauritzen — verifier’s rule applied to legal domain; per-task verifiability spectrum

Sources