⚡ Promptolis Original · AI Agents & Automation

🧪 Agent Eval Harness Builder

Designs an evaluation harness for your agent — with golden tests, regression alarms, and trajectory-level metrics that catch silent quality regressions before customers do.

⏱️ 6 min to set up 🤖 ~140 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most teams ship agents with no evals beyond 'we tried it and it worked.' Then the model upgrades, a tool description changes, or a prompt edit ships — and quality silently drops 15%. The first signal is a customer complaint.

This Original designs the eval harness — golden test cases, success rubrics, trajectory-level metrics, regression alarms. Not 'add more tests' generic advice; an actual designed harness for YOUR agent.

Calibrated to 2026 agent eval reality: trajectory evals (did the agent take a reasonable path?) matter more than pure output evals. Built for Claude Code, MCP agents, custom orchestrations.

The prompt

Promptolis Original · Copy-ready
<role> You are an LLM evaluation engineer with 4+ years building eval harnesses for production agents on Claude Code, custom orchestrators, MCP-based systems, and OpenAI Apps. You have shipped 30+ eval suites that caught regressions before customers did. You think in trajectories, not just outputs. You are direct. You will tell a builder their evals are too synthetic, too output-focused, or too small to detect 15% regressions. You refuse to recommend 'add more tests' as generic advice — you will design specific tests for specific failure modes. </role> <principles> 1. Trajectory evals beat output evals. Score the path the agent took, not just the destination. 2. Golden cases must come from real production failures, not synthetic edge cases. 3. Three eval layers: unit (tool selection), integration (multi-step), end-to-end (output quality + cost). Each catches different bugs. 4. Eval judges must be different model than production. Same-model judging produces blind spots. 5. Cost-per-trajectory is part of the metric. A correct-but-expensive agent is a regression. 6. Baseline before changing anything. The first eval run defines the reference numbers. 7. Run evals on every change. Prompt, tool description, model — all of it triggers re-eval. </principles> <input> <agent-purpose>{what the agent does end-to-end}</agent-purpose> <agent-architecture>{single agent / multi-agent / subagents — describe briefly}</agent-architecture> <known-failure-modes>{what has gone wrong, even rarely — be specific}</known-failure-modes> <production-data-availability>{can you log + replay production trajectories? do you have user feedback?}</production-data-availability> <scale>{trajectories/day, latency budget, cost budget per trajectory}</scale> <launch-stage>{pre-launch / shadow / production with N customers / production at scale}</launch-stage> <success-criteria>{what does 'this agent works' mean to your team? be specific}</success-criteria> <existing-evals>{any tests you have today — describe even if minimal}</existing-evals> </input> <output-format> # Eval Harness Design: [Agent name] ## Eval Strategy Overview What layers (unit/integration/e2e), what scale, what cadence. Why this design for your agent. ## Golden Test Set 10-30 specific test cases. For each: input, expected trajectory pattern, expected output, scoring rubric. Mark which came from real failures vs synthetic. ## Trajectory Evaluator How to score the path the agent took, not just the output. Specific signals to extract from tool-call logs. ## Output Evaluator The rubric for final-output quality. LLM-as-judge prompt or programmatic check, depending on output type. ## Cost & Latency Tracking What to track per trajectory. Thresholds for regression alarms. ## Regression Alarm Setup What threshold drops trigger an alert. How to distinguish noise from real regression. ## Eval Cadence When to run: every prompt edit, every model change, weekly hygiene, etc. ## Implementation Skeleton File structure, key code components, how to wire into CI/CD. ## What This Eval Will and Won't Catch Honest tradeoffs. The known blind spots. ## Migration Path If existing evals exist: how to transition without breaking current pipelines. ## Key Takeaways 4-6 bullets — for the team's eval playbook. </output-format> <auto-intake> If input incomplete: ask for agent purpose, architecture, known failures, production data availability, scale, launch stage, success criteria, existing evals. </auto-intake> Now, design the eval harness:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<agent-purpose>Customer-support triage agent for a B2B SaaS (project management tool). Reads incoming support tickets, classifies into 6 categories (bug, feature-request, account, billing, integration, other), drafts a first-response, and routes to the right human team.</agent-purpose>
<agent-architecture>Single Claude Sonnet agent with 4 tools: read_kb_article, search_past_tickets, lookup_customer_account, send_draft_to_queue.</agent-architecture>
<known-failure-modes>1. Mis-routing 'bug' as 'feature-request' when customer phrases it constructively. 2. Hallucinating customer account info when lookup_customer_account fails silently. 3. Drafts that include incorrect KB article links. 4. Occasional empty draft when ticket is too short.</known-failure-modes>
<production-data-availability>Yes, we log all trajectories + we have human-team feedback ('this routing was wrong').</production-data-availability>
<scale>~600 tickets/day. Latency budget: 2 minutes from ticket arrival to draft + route. Cost budget: $0.05/ticket.</scale>
<launch-stage>In production at scale (60% of all tickets routed by agent, rest go to humans first). Want better evals before increasing to 90%.</launch-stage>
<success-criteria>Routing accuracy >92%. Draft quality (human-reviewer accept rate without major edits) >80%. Hallucination rate (incorrect facts in draft) <2%.</success-criteria>
<existing-evals>Internal QA team manually reviews 20 trajectories/week. We have a Sheet of 'bad cases' the team flagged but no automated eval pipeline.</existing-evals>
🤖 Output

Eval Harness Design: Customer-Support Triage Agent

Eval Strategy Overview

Three-layer harness with shadow-mode regression detection:

  • Unit layer: Tool-selection correctness on 50 cases (does the agent pick lookup_customer_account when account info is needed, etc.)
  • Integration layer: Multi-step trajectory scoring on 80 cases (does the agent search KB before drafting, does it verify customer state before personalizing, etc.)
  • End-to-end layer: Output quality on 100 golden cases (correct routing + draft accept rate + hallucination check)
  • Shadow eval: Live system runs in parallel; eval pipeline diffs against current production for any prompt/model change.

Why this design: your three known failure modes (mis-routing, hallucination, empty drafts) each surface at a different layer. Single-layer evals would miss two of three. With shadow eval added, you catch model-version regressions without waiting for the manual QA cycle.

Golden Test Set

Source distribution: 60% from your existing 'bad cases' Sheet (real failures), 40% from sampled production tickets (success cases for baseline). Total: 230 cases (50 unit + 80 integration + 100 e2e).

E2E sample (10 of 100, illustrative):

#Input ticket gistSourceExpected categoryExpected trajectoryExpected outputScoring rubric
1"Hey, the export-to-CSV button gives me a 500 error every time. Tried in Chrome and Firefox."Bad case from Sheetbugsearch_past_tickets → read_kb_article('export-csv') → draft → routeRoutes to bug team, draft references known issue if existsHallucination check on KB link, routing-correct binary, draft-quality 1-5
2"It would be great if the timeline view showed dependencies as arrows. We'd use this every day."Bad casefeature-requestsearch_past_tickets (check if duplicate FR) → draft → routeRoutes to product team, draft acknowledges timelineRouting-correct, duplicate-detection signal, draft-tone-appropriate
3"Can you delete my account please."Production sampleaccountlookup_customer_account → draft → routeRoutes to account team, draft confirms identity verification stepRouting-correct, draft includes verification handoff (no hallucinated account info)
4"Where's the option to integrate with Linear? I see Jira but not Linear."Production sampleintegrationsearch_past_tickets → read_kb_article('integrations') → draft → routeRoutes to integrations team, draft includes Linear roadmap status if in KBHallucination check on integration availability claim, routing-correct
5"Invoice for March was $1,200, expected $800. Help."Production samplebillinglookup_customer_account → draft → routeRoutes to billing team, draft acknowledges discrepancy without quoting numbers from hallucinationHallucination check (no invented invoice details), routing-correct, draft-empathy 1-5
6"This product is terrible! Cancel everything!"Bad caseaccountlookup_customer_account → search_past_tickets → draft → routeRoutes to account team with high-priority flag, draft is empathetic + de-escalationEmpathy 1-5, routing-correct, priority-flag-set binary
7"Hi" (one word)Bad case (empty-draft case)otherclarification draft, no routingDraft asks clarifying question, ticket held in 'awaiting-clarification'Non-empty-draft binary, clarification quality 1-5
8"The Slack integration randomly stops sending notifications about once a week."Productionbug or integrationsearch_past_tickets → read_kb_article → draft → routeRoutes to integrations team (judgment call: integration-specific bug), draft references reset steps if in KBRouting-correct (allow either bug or integration as correct), KB-link-validity
9"Can you send me an invoice from Q4 2024?"Productionbillinglookup_customer_account → draft → routeRoutes to billing, draft promises follow-up timelineRouting-correct, no hallucinated invoice number, latency tracking
10"Just wanted to say I love the new dashboard! 🎉"Productionotherdirect draft (no tools needed)Routes to 'other'/feedback queue, draft thanks customerRouting-correct, latency-low (should be <30s), tool-use-minimal binary

Full set of 100 e2e cases mirrors your real category distribution: 30 bug, 25 feature-request, 15 account, 10 billing, 15 integration, 5 other.

Unit layer cases (50):

  • Each case is a 1-2 turn excerpt where the agent should pick a specific tool. Score: did it pick the correct tool first?
  • Examples: 'Customer asks about plan tier' → first tool should be lookup_customer_account, not search_past_tickets.

Integration layer cases (80):

  • Each case is a multi-turn trajectory. Score the trajectory pattern, not the output.
  • Examples: 'Did the agent search KB before drafting a technical answer?' (yes/no), 'Did the agent verify account before quoting plan-specific info?' (yes/no).

Trajectory Evaluator

Programmatic checks (cheap, deterministic):

  • Tool-call count per ticket. Median + p95. Regression: median doubles or p95 triples.
  • Tool-call sequence pattern matching. Define expected sequences per category. Score % of trajectories matching ≥1 valid sequence.
  • KB-link extraction from drafts → verify links exist (404 check via cron). Hallucinated KB links are a critical regression.
  • Latency per trajectory. Target: p95 <60s, p99 <120s.
  • Cost per trajectory. Target: median <$0.04, p95 <$0.08.

LLM-as-judge checks (use Claude Opus, NOT same model as production Sonnet):

  • Trajectory reasonableness: 'Did the agent take a reasonable path to the answer?' Score 1-5 with rubric. Calibrated against 30 manually-scored trajectories.
  • Hallucination check: 'Does the draft contain any factual claim not supported by tool outputs?' Binary. This is the highest-stakes check.
  • Routing-correct check: agreement between agent's category and ground-truth category from human review.

Output Evaluator

Routing accuracy (programmatic): Compare agent's category to ground-truth from human-team feedback. Target ≥92%, alarm at <88%.

Draft quality (LLM-as-judge with Claude Opus):

Prompt: 'You are evaluating a customer-support triage draft. Rate 1-5 on: (a) addresses the customer's actual question, (b) tone-appropriate, (c) factually grounded in tool outputs, (d) actionable next step. Output JSON: {addresses: 1-5, tone: 1-5, factual: 1-5, actionable: 1-5, overall: 1-5, would_a_human_send_this_with_minor_edits: bool}.' Calibrate against 50 human-reviewed drafts.

Hallucination rate (programmatic + LLM):

  • Programmatic: extract any quoted numbers, account-IDs, KB-URLs from draft. Verify against tool outputs.
  • LLM: 'Does the draft contain any factual claim NOT in the tool outputs above?' Run on every eval trajectory.
  • Target: <2% hallucination rate. Alarm at >3%.

Empty-draft detection (programmatic): any draft <30 characters or matches templates like 'I'll get back to you' is flagged. Target: <0.5%.

Cost & Latency Tracking

Track per-trajectory:

  • Tokens in / tokens out / total cost
  • Wall-clock latency from ticket arrival to draft+route
  • Tool-call count
  • Number of LLM turns

Dashboard: weekly aggregates with p50/p95/p99. Alarm thresholds:

  • Median cost increases >20% week-over-week
  • p95 latency exceeds budget for >5% of trajectories
  • Tool-call count median increases >50% (signals tool-misselection-then-recovery patterns)

Regression Alarm Setup

On every prompt edit, model change, or tool definition change: automated eval run on the 230-case golden set. Block deploy if:

  • Routing accuracy drops >2 percentage points vs baseline
  • Hallucination rate increases >1 percentage point
  • Median cost increases >20%
  • Trajectory-reasonableness score drops >0.3 points (on 5-point scale)

Weekly hygiene run on a fresh sample of 200 production trajectories (sampled, anonymized). Catches drift not visible in fixed golden set.

Distinguishing noise from regression: all metrics tracked with 7-day rolling windows. A single bad day can be sampling variance; a 3-day-rolling drop of 2+ points is signal.

Eval Cadence

  • Pre-deploy of any prompt/tool/model change: full 230-case golden run + go/no-go criteria
  • Weekly: 200-case fresh production sample + drift dashboard
  • Monthly: review and refresh 10-20% of golden cases (rotate stale ones, add new failure modes)
  • Quarterly: full audit of eval harness itself — are you measuring what matters?

Implementation Skeleton

/evals
  /golden_cases.jsonl       (230 cases with ground truth)
  /unit_cases.jsonl         (50 tool-selection cases)
  /trajectory_cases.jsonl   (80 integration cases)
  /e2e_cases.jsonl          (100 end-to-end cases)
/runners
  run_unit_evals.py         (replays unit cases, scores tool selection)
  run_trajectory_evals.py   (replays integration cases, scores path pattern)
  run_e2e_evals.py          (full agent invocation, scores output via judge)
  run_shadow_eval.py        (samples live traffic, runs candidate agent in parallel, diffs)
/judges
  judge_routing.py          (programmatic: agent_category vs ground_truth)
  judge_draft.py            (Claude Opus LLM-as-judge for draft quality)
  judge_hallucination.py    (programmatic + LLM hallucination check)
  judge_trajectory.py       (Claude Opus trajectory reasonableness)
/dashboards
  weekly_drift.sql          (rolling 7-day metric query)
  regression_alarm.py       (post-deploy diff vs baseline)

Wire into CI:

  • GitHub Action triggered on PR: full golden eval run, post results as PR comment, block merge if regression thresholds tripped.
  • Nightly cron: weekly hygiene run.
  • Webhook from production: shadow eval continuous run on 5% sampled traffic.

What This Eval Will and Won't Catch

Will catch:

  • Routing-accuracy regressions of ≥2 points
  • Hallucination-rate increases of ≥1 point
  • Cost regressions of ≥20%
  • Tool-misselection regressions on the 50 unit cases
  • Empty-draft regressions
  • Specific known failure modes from your bad-case Sheet

Will NOT catch (known blind spots):

  • Long-tail edge cases not in your 230 golden cases (hence weekly fresh-sample run)
  • Quality regressions that maintain routing accuracy but produce subtly worse drafts (LLM judge has known calibration limits)
  • Drift in customer expectations over time (your bar shifts; eval bar doesn't, until you refresh cases)
  • Performance issues caused by the 4 tools' downstream APIs (tool-side problems, not agent problems)
  • New failure modes introduced by major architecture changes

Migration Path

1. Week 1: Implement runners + judges. Run baseline eval on current production agent. Freeze numbers.

2. Week 2: Wire CI. Block any PR that regresses >2 points on golden set.

3. Week 3: Implement shadow eval. Run on 1% of traffic.

4. Week 4: Scale shadow to 5% of traffic. Set up dashboards.

5. Week 5: Replace your manual 20-trajectory/week QA with the new harness's weekly hygiene run + a smaller manual review (5/week) of evals-flagged cases.

6. Week 6+: Use the regression alarms to safely scale routing from 60% → 75% → 90% of tickets.

Do NOT increase production routing share until baseline is measured AND CI block is in place. Otherwise you're flying blind on the way up.

Key Takeaways

  • Your three known failure modes need three different eval layers (mis-routing → e2e routing-accuracy, hallucination → trajectory + output hallucination check, empty-draft → programmatic length check).
  • Use Claude Opus as judge, not Sonnet (your production model). Same-model judging has blind spots on subtle quality regressions.
  • Source 60% of golden cases from real failures, not synthetic edge cases. Your bad-case Sheet is gold.
  • Implement shadow eval BEFORE scaling routing share. It's the only way to catch regressions that pass golden eval but break on real traffic distribution.
  • Cost is part of the eval. A correct-but-expensive agent is a regression; track p50/p95 cost from day one.
  • Refresh 10-20% of golden cases monthly. Static eval sets decay as the agent and your customer base evolve.

Common use cases

  • Engineer about to ship an agent to production and worried about silent regressions
  • Team that already has an agent in prod and gets occasional complaints — needs to add evals retroactively
  • Builder evaluating a model upgrade (Opus 4 → Opus 4.5) and needs a rigorous diff
  • Solo dev hitting 'works on my prompts but fails on real users' problem
  • PM writing a launch criteria doc for an agent and needs measurable success thresholds

Best AI model for this

Claude Opus 4. Eval design requires reasoning about success criteria, edge cases, and metric design — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

  • Trajectory evals beat output evals. Two trajectories can produce the same final output but one is dangerously fragile. Score the path, not just the destination.
  • Golden test cases must include the failures you've ACTUALLY seen. Synthetic edge cases miss the real pain points.
  • Separate eval models from production models. If you use Claude Opus for both production and eval judging, you'll mark some failures as successes.
  • Run evals on every prompt edit, every tool change, every model bump. Cheap to run, expensive to skip.
  • Track three layers: unit (tool selection correctness), integration (multi-step flow correctness), end-to-end (final output quality). Different bugs surface at different layers.
  • Cost is part of the eval. An agent that produces correct output but costs 3× more is a regression. Always include cost-per-trajectory.
  • Baseline first. Run the eval suite on your current agent and freeze numbers. Future runs compare to this baseline.

Customization tips

  • Be honest about your failure modes. The eval harness is only as good as the failures you can name. If 'we don't really know what fails' — say that, and the harness will include exploratory eval.
  • Specify whether you can replay production trajectories. This dramatically changes what's possible (shadow eval, regression detection, etc.).
  • If pre-launch, ask for the Pre-Launch Mode variant — it adds launch-criteria thresholds rather than regression-from-baseline thresholds.
  • Calibrate the LLM-as-judge prompts on your real data. Run the judge on 30-50 manually-scored cases, check correlation with your scores, iterate the rubric until correlation > 0.7.
  • Don't skip the cost tracking. Cost regressions are the most common overlooked regression — the agent still works, but each call costs 3× more.
  • Re-run quarterly even if no code changes. Your customer base shifts, your bar shifts, eval cases stale. The harness is a living artifact.

Variants

Production Agent Mode

For agents already in production — adds shadow-eval setup that runs in parallel with prod traffic.

Pre-Launch Mode

For agents not yet shipped — adds launch-criteria thresholds and rollout decision rubric.

Model Upgrade Mode

Specifically for evaluating model swaps (Opus 4 → 4.5, Sonnet → Opus, etc.) — adds A/B trajectory comparison.

MCP Server Eval Mode

For evaluating an MCP server — adds tool-level conformance tests and permission-boundary tests.

Frequently asked questions

How do I use the Agent Eval Harness Builder prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Agent Eval Harness Builder?

Claude Opus 4. Eval design requires reasoning about success criteria, edge cases, and metric design — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Agent Eval Harness Builder prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Trajectory evals beat output evals. Two trajectories can produce the same final output but one is dangerously fragile. Score the path, not just the destination.; Golden test cases must include the failures you've ACTUALLY seen. Synthetic edge cases miss the real pain points.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals