⚡ Promptolis Original · AI Agents & Automation
🧪 Agent Eval Harness Builder
Designs an evaluation harness for your agent — with golden tests, regression alarms, and trajectory-level metrics that catch silent quality regressions before customers do.
Why this is epic
Most teams ship agents with no evals beyond 'we tried it and it worked.' Then the model upgrades, a tool description changes, or a prompt edit ships — and quality silently drops 15%. The first signal is a customer complaint.
This Original designs the eval harness — golden test cases, success rubrics, trajectory-level metrics, regression alarms. Not 'add more tests' generic advice; an actual designed harness for YOUR agent.
Calibrated to 2026 agent eval reality: trajectory evals (did the agent take a reasonable path?) matter more than pure output evals. Built for Claude Code, MCP agents, custom orchestrations.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<agent-purpose>Customer-support triage agent for a B2B SaaS (project management tool). Reads incoming support tickets, classifies into 6 categories (bug, feature-request, account, billing, integration, other), drafts a first-response, and routes to the right human team.</agent-purpose>
<agent-architecture>Single Claude Sonnet agent with 4 tools: read_kb_article, search_past_tickets, lookup_customer_account, send_draft_to_queue.</agent-architecture>
<known-failure-modes>1. Mis-routing 'bug' as 'feature-request' when customer phrases it constructively. 2. Hallucinating customer account info when lookup_customer_account fails silently. 3. Drafts that include incorrect KB article links. 4. Occasional empty draft when ticket is too short.</known-failure-modes>
<production-data-availability>Yes, we log all trajectories + we have human-team feedback ('this routing was wrong').</production-data-availability>
<scale>~600 tickets/day. Latency budget: 2 minutes from ticket arrival to draft + route. Cost budget: $0.05/ticket.</scale>
<launch-stage>In production at scale (60% of all tickets routed by agent, rest go to humans first). Want better evals before increasing to 90%.</launch-stage>
<success-criteria>Routing accuracy >92%. Draft quality (human-reviewer accept rate without major edits) >80%. Hallucination rate (incorrect facts in draft) <2%.</success-criteria>
<existing-evals>Internal QA team manually reviews 20 trajectories/week. We have a Sheet of 'bad cases' the team flagged but no automated eval pipeline.</existing-evals>
Eval Harness Design: Customer-Support Triage Agent
Eval Strategy Overview
Three-layer harness with shadow-mode regression detection:
- Unit layer: Tool-selection correctness on 50 cases (does the agent pick lookup_customer_account when account info is needed, etc.)
- Integration layer: Multi-step trajectory scoring on 80 cases (does the agent search KB before drafting, does it verify customer state before personalizing, etc.)
- End-to-end layer: Output quality on 100 golden cases (correct routing + draft accept rate + hallucination check)
- Shadow eval: Live system runs in parallel; eval pipeline diffs against current production for any prompt/model change.
Why this design: your three known failure modes (mis-routing, hallucination, empty drafts) each surface at a different layer. Single-layer evals would miss two of three. With shadow eval added, you catch model-version regressions without waiting for the manual QA cycle.
Golden Test Set
Source distribution: 60% from your existing 'bad cases' Sheet (real failures), 40% from sampled production tickets (success cases for baseline). Total: 230 cases (50 unit + 80 integration + 100 e2e).
E2E sample (10 of 100, illustrative):
| # | Input ticket gist | Source | Expected category | Expected trajectory | Expected output | Scoring rubric |
|---|---|---|---|---|---|---|
| 1 | "Hey, the export-to-CSV button gives me a 500 error every time. Tried in Chrome and Firefox." | Bad case from Sheet | bug | search_past_tickets → read_kb_article('export-csv') → draft → route | Routes to bug team, draft references known issue if exists | Hallucination check on KB link, routing-correct binary, draft-quality 1-5 |
| 2 | "It would be great if the timeline view showed dependencies as arrows. We'd use this every day." | Bad case | feature-request | search_past_tickets (check if duplicate FR) → draft → route | Routes to product team, draft acknowledges timeline | Routing-correct, duplicate-detection signal, draft-tone-appropriate |
| 3 | "Can you delete my account please." | Production sample | account | lookup_customer_account → draft → route | Routes to account team, draft confirms identity verification step | Routing-correct, draft includes verification handoff (no hallucinated account info) |
| 4 | "Where's the option to integrate with Linear? I see Jira but not Linear." | Production sample | integration | search_past_tickets → read_kb_article('integrations') → draft → route | Routes to integrations team, draft includes Linear roadmap status if in KB | Hallucination check on integration availability claim, routing-correct |
| 5 | "Invoice for March was $1,200, expected $800. Help." | Production sample | billing | lookup_customer_account → draft → route | Routes to billing team, draft acknowledges discrepancy without quoting numbers from hallucination | Hallucination check (no invented invoice details), routing-correct, draft-empathy 1-5 |
| 6 | "This product is terrible! Cancel everything!" | Bad case | account | lookup_customer_account → search_past_tickets → draft → route | Routes to account team with high-priority flag, draft is empathetic + de-escalation | Empathy 1-5, routing-correct, priority-flag-set binary |
| 7 | "Hi" (one word) | Bad case (empty-draft case) | other | clarification draft, no routing | Draft asks clarifying question, ticket held in 'awaiting-clarification' | Non-empty-draft binary, clarification quality 1-5 |
| 8 | "The Slack integration randomly stops sending notifications about once a week." | Production | bug or integration | search_past_tickets → read_kb_article → draft → route | Routes to integrations team (judgment call: integration-specific bug), draft references reset steps if in KB | Routing-correct (allow either bug or integration as correct), KB-link-validity |
| 9 | "Can you send me an invoice from Q4 2024?" | Production | billing | lookup_customer_account → draft → route | Routes to billing, draft promises follow-up timeline | Routing-correct, no hallucinated invoice number, latency tracking |
| 10 | "Just wanted to say I love the new dashboard! 🎉" | Production | other | direct draft (no tools needed) | Routes to 'other'/feedback queue, draft thanks customer | Routing-correct, latency-low (should be <30s), tool-use-minimal binary |
Full set of 100 e2e cases mirrors your real category distribution: 30 bug, 25 feature-request, 15 account, 10 billing, 15 integration, 5 other.
Unit layer cases (50):
- Each case is a 1-2 turn excerpt where the agent should pick a specific tool. Score: did it pick the correct tool first?
- Examples: 'Customer asks about plan tier' → first tool should be lookup_customer_account, not search_past_tickets.
Integration layer cases (80):
- Each case is a multi-turn trajectory. Score the trajectory pattern, not the output.
- Examples: 'Did the agent search KB before drafting a technical answer?' (yes/no), 'Did the agent verify account before quoting plan-specific info?' (yes/no).
Trajectory Evaluator
Programmatic checks (cheap, deterministic):
- Tool-call count per ticket. Median + p95. Regression: median doubles or p95 triples.
- Tool-call sequence pattern matching. Define expected sequences per category. Score % of trajectories matching ≥1 valid sequence.
- KB-link extraction from drafts → verify links exist (404 check via cron). Hallucinated KB links are a critical regression.
- Latency per trajectory. Target: p95 <60s, p99 <120s.
- Cost per trajectory. Target: median <$0.04, p95 <$0.08.
LLM-as-judge checks (use Claude Opus, NOT same model as production Sonnet):
- Trajectory reasonableness: 'Did the agent take a reasonable path to the answer?' Score 1-5 with rubric. Calibrated against 30 manually-scored trajectories.
- Hallucination check: 'Does the draft contain any factual claim not supported by tool outputs?' Binary. This is the highest-stakes check.
- Routing-correct check: agreement between agent's category and ground-truth category from human review.
Output Evaluator
Routing accuracy (programmatic): Compare agent's category to ground-truth from human-team feedback. Target ≥92%, alarm at <88%.
Draft quality (LLM-as-judge with Claude Opus):
Prompt: 'You are evaluating a customer-support triage draft. Rate 1-5 on: (a) addresses the customer's actual question, (b) tone-appropriate, (c) factually grounded in tool outputs, (d) actionable next step. Output JSON: {addresses: 1-5, tone: 1-5, factual: 1-5, actionable: 1-5, overall: 1-5, would_a_human_send_this_with_minor_edits: bool}.' Calibrate against 50 human-reviewed drafts.
Hallucination rate (programmatic + LLM):
- Programmatic: extract any quoted numbers, account-IDs, KB-URLs from draft. Verify against tool outputs.
- LLM: 'Does the draft contain any factual claim NOT in the tool outputs above?' Run on every eval trajectory.
- Target: <2% hallucination rate. Alarm at >3%.
Empty-draft detection (programmatic): any draft <30 characters or matches templates like 'I'll get back to you' is flagged. Target: <0.5%.
Cost & Latency Tracking
Track per-trajectory:
- Tokens in / tokens out / total cost
- Wall-clock latency from ticket arrival to draft+route
- Tool-call count
- Number of LLM turns
Dashboard: weekly aggregates with p50/p95/p99. Alarm thresholds:
- Median cost increases >20% week-over-week
- p95 latency exceeds budget for >5% of trajectories
- Tool-call count median increases >50% (signals tool-misselection-then-recovery patterns)
Regression Alarm Setup
On every prompt edit, model change, or tool definition change: automated eval run on the 230-case golden set. Block deploy if:
- Routing accuracy drops >2 percentage points vs baseline
- Hallucination rate increases >1 percentage point
- Median cost increases >20%
- Trajectory-reasonableness score drops >0.3 points (on 5-point scale)
Weekly hygiene run on a fresh sample of 200 production trajectories (sampled, anonymized). Catches drift not visible in fixed golden set.
Distinguishing noise from regression: all metrics tracked with 7-day rolling windows. A single bad day can be sampling variance; a 3-day-rolling drop of 2+ points is signal.
Eval Cadence
- Pre-deploy of any prompt/tool/model change: full 230-case golden run + go/no-go criteria
- Weekly: 200-case fresh production sample + drift dashboard
- Monthly: review and refresh 10-20% of golden cases (rotate stale ones, add new failure modes)
- Quarterly: full audit of eval harness itself — are you measuring what matters?
Implementation Skeleton
/evals
/golden_cases.jsonl (230 cases with ground truth)
/unit_cases.jsonl (50 tool-selection cases)
/trajectory_cases.jsonl (80 integration cases)
/e2e_cases.jsonl (100 end-to-end cases)
/runners
run_unit_evals.py (replays unit cases, scores tool selection)
run_trajectory_evals.py (replays integration cases, scores path pattern)
run_e2e_evals.py (full agent invocation, scores output via judge)
run_shadow_eval.py (samples live traffic, runs candidate agent in parallel, diffs)
/judges
judge_routing.py (programmatic: agent_category vs ground_truth)
judge_draft.py (Claude Opus LLM-as-judge for draft quality)
judge_hallucination.py (programmatic + LLM hallucination check)
judge_trajectory.py (Claude Opus trajectory reasonableness)
/dashboards
weekly_drift.sql (rolling 7-day metric query)
regression_alarm.py (post-deploy diff vs baseline)
Wire into CI:
- GitHub Action triggered on PR: full golden eval run, post results as PR comment, block merge if regression thresholds tripped.
- Nightly cron: weekly hygiene run.
- Webhook from production: shadow eval continuous run on 5% sampled traffic.
What This Eval Will and Won't Catch
Will catch:
- Routing-accuracy regressions of ≥2 points
- Hallucination-rate increases of ≥1 point
- Cost regressions of ≥20%
- Tool-misselection regressions on the 50 unit cases
- Empty-draft regressions
- Specific known failure modes from your bad-case Sheet
Will NOT catch (known blind spots):
- Long-tail edge cases not in your 230 golden cases (hence weekly fresh-sample run)
- Quality regressions that maintain routing accuracy but produce subtly worse drafts (LLM judge has known calibration limits)
- Drift in customer expectations over time (your bar shifts; eval bar doesn't, until you refresh cases)
- Performance issues caused by the 4 tools' downstream APIs (tool-side problems, not agent problems)
- New failure modes introduced by major architecture changes
Migration Path
1. Week 1: Implement runners + judges. Run baseline eval on current production agent. Freeze numbers.
2. Week 2: Wire CI. Block any PR that regresses >2 points on golden set.
3. Week 3: Implement shadow eval. Run on 1% of traffic.
4. Week 4: Scale shadow to 5% of traffic. Set up dashboards.
5. Week 5: Replace your manual 20-trajectory/week QA with the new harness's weekly hygiene run + a smaller manual review (5/week) of evals-flagged cases.
6. Week 6+: Use the regression alarms to safely scale routing from 60% → 75% → 90% of tickets.
Do NOT increase production routing share until baseline is measured AND CI block is in place. Otherwise you're flying blind on the way up.
Key Takeaways
- Your three known failure modes need three different eval layers (mis-routing → e2e routing-accuracy, hallucination → trajectory + output hallucination check, empty-draft → programmatic length check).
- Use Claude Opus as judge, not Sonnet (your production model). Same-model judging has blind spots on subtle quality regressions.
- Source 60% of golden cases from real failures, not synthetic edge cases. Your bad-case Sheet is gold.
- Implement shadow eval BEFORE scaling routing share. It's the only way to catch regressions that pass golden eval but break on real traffic distribution.
- Cost is part of the eval. A correct-but-expensive agent is a regression; track p50/p95 cost from day one.
- Refresh 10-20% of golden cases monthly. Static eval sets decay as the agent and your customer base evolve.
Common use cases
- Engineer about to ship an agent to production and worried about silent regressions
- Team that already has an agent in prod and gets occasional complaints — needs to add evals retroactively
- Builder evaluating a model upgrade (Opus 4 → Opus 4.5) and needs a rigorous diff
- Solo dev hitting 'works on my prompts but fails on real users' problem
- PM writing a launch criteria doc for an agent and needs measurable success thresholds
Best AI model for this
Claude Opus 4. Eval design requires reasoning about success criteria, edge cases, and metric design — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Pro tips
- Trajectory evals beat output evals. Two trajectories can produce the same final output but one is dangerously fragile. Score the path, not just the destination.
- Golden test cases must include the failures you've ACTUALLY seen. Synthetic edge cases miss the real pain points.
- Separate eval models from production models. If you use Claude Opus for both production and eval judging, you'll mark some failures as successes.
- Run evals on every prompt edit, every tool change, every model bump. Cheap to run, expensive to skip.
- Track three layers: unit (tool selection correctness), integration (multi-step flow correctness), end-to-end (final output quality). Different bugs surface at different layers.
- Cost is part of the eval. An agent that produces correct output but costs 3× more is a regression. Always include cost-per-trajectory.
- Baseline first. Run the eval suite on your current agent and freeze numbers. Future runs compare to this baseline.
Customization tips
- Be honest about your failure modes. The eval harness is only as good as the failures you can name. If 'we don't really know what fails' — say that, and the harness will include exploratory eval.
- Specify whether you can replay production trajectories. This dramatically changes what's possible (shadow eval, regression detection, etc.).
- If pre-launch, ask for the Pre-Launch Mode variant — it adds launch-criteria thresholds rather than regression-from-baseline thresholds.
- Calibrate the LLM-as-judge prompts on your real data. Run the judge on 30-50 manually-scored cases, check correlation with your scores, iterate the rubric until correlation > 0.7.
- Don't skip the cost tracking. Cost regressions are the most common overlooked regression — the agent still works, but each call costs 3× more.
- Re-run quarterly even if no code changes. Your customer base shifts, your bar shifts, eval cases stale. The harness is a living artifact.
Variants
Production Agent Mode
For agents already in production — adds shadow-eval setup that runs in parallel with prod traffic.
Pre-Launch Mode
For agents not yet shipped — adds launch-criteria thresholds and rollout decision rubric.
Model Upgrade Mode
Specifically for evaluating model swaps (Opus 4 → 4.5, Sonnet → Opus, etc.) — adds A/B trajectory comparison.
MCP Server Eval Mode
For evaluating an MCP server — adds tool-level conformance tests and permission-boundary tests.
Frequently asked questions
How do I use the Agent Eval Harness Builder prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Agent Eval Harness Builder?
Claude Opus 4. Eval design requires reasoning about success criteria, edge cases, and metric design — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Can I customize the Agent Eval Harness Builder prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Trajectory evals beat output evals. Two trajectories can produce the same final output but one is dangerously fragile. Score the path, not just the destination.; Golden test cases must include the failures you've ACTUALLY seen. Synthetic edge cases miss the real pain points.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals