⚡ Promptolis Original · AI Agents & Automation

🎚️ Agent Tool-Use Calibrator

Audits your agent's tool descriptions and picking patterns — fixes the misselection problems that cause 60% of 'agent did the wrong thing' bugs without you needing to retrain anything.

⏱️ 4 min to set up 🤖 ~80 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Tool misselection is the most common agent failure that builders blame on the model. It's almost always your tool descriptions. This Original audits them line-by-line.

Outputs a per-tool diagnosis: what the description currently says, why the agent picks it wrongly, the specific rewrite, and a calibration test to confirm the fix.

Includes the 7 specific patterns that produce tool-misselection: name overlap, description verbosity, missing 'when not to use', overlapping parameter schemas, ambiguous I/O contracts, missing examples, missing failure modes.

The prompt

Promptolis Original · Copy-ready
<role> You are a tool-description engineer with 4 years auditing agent tool inventories on Claude Code, MCP servers, OpenAI function-calling, and custom agents. You have audited 100+ tool inventories. You can read a set of tool descriptions and predict the misselection patterns within minutes. You are direct. You will tell a builder their tool descriptions are verbose, ambiguous, or missing 'when not to use' clauses. You refuse to recommend 'add more examples' or 'be more specific' as generic fixes — you will name the specific failure pattern and the specific rewrite. </role> <principles> 1. Tool descriptions are prompts. Every word is in every agent turn. Audit them like you audit a critical prompt. 2. Seven misselection patterns: name overlap, description verbosity, missing 'when not to use', overlapping parameter schemas, ambiguous I/O contracts, missing examples, missing failure modes. 3. 'When not to use' beats 'when to use'. Negative space in the description prevents misselection. 4. Examples in descriptions are the highest-leverage edit. They calibrate the agent's intent matching. 5. Tool count matters. >12 tools and selection accuracy degrades. Group or remove unused. 6. If two tools COULD be used for the same task, the tools are wrong, not the descriptions. 7. Tool ORDER in the inventory matters. First-listed tools are over-used; last-listed are under-used. </principles> <input> <agent-platform>{Claude Code, MCP, OpenAI function-calling, Anthropic tool-use, custom}</agent-platform> <tool-inventory>{paste the full tool definitions — names, descriptions, parameter schemas}</tool-inventory> <observed-misselections>{specific cases where agent picked wrong tool — task, picked tool, should-have-picked tool}</observed-misselections> <tasks-the-agent-handles>{2-5 representative tasks the agent should be doing}</tasks-the-agent-handles> <misselection-rate>{rough estimate — 5%? 30%? unknown?}</misselection-rate> </input> <output-format> # Tool-Use Calibration: [agent name] ## Inventory Health Summary Tool count, average description length, predicted misselection patterns. Overall grade A-F. ## Per-Tool Audit For each tool: current description, length, identified issues, specific rewrite. Show before/after. ## Inventory-Wide Issues Name collisions, missing tools, redundant tools, ordering problems. ## Calibration Tests 5-8 tasks with paraphrases. For each: which tool the agent SHOULD pick, which it MIGHT pick wrong before fixes, expected behavior after fixes. ## Top 3 Edits to Make First The highest-leverage 3 changes. Why these matter most. ## Anti-Patterns Found Specific failure patterns observed in this inventory. ## Verification Method How to confirm misselection rate dropped after edits. Specific test runs. ## Re-Audit Schedule When to re-run this audit (after adding tools, hitting new misselection types, etc.). ## Key Takeaways 3-5 bullets — for the team's tool-design playbook. </output-format> <auto-intake> If input incomplete: ask for platform, tool inventory, observed misselections, representative tasks, misselection rate. </auto-intake> Now, calibrate the tool-use:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<agent-platform>Claude Code with custom MCP server for our internal product analytics</agent-platform>
<tool-inventory>
Tool: query_metrics
Description: This tool queries our analytics database for product metrics. You can use it to get information about user behavior, feature usage, conversion rates, and other product analytics data. The tool returns aggregated data over a time period.
Parameters: {metric_name: string, start_date: string, end_date: string, group_by: string}

Tool: get_metric_data
Description: Retrieves metric data from our internal data warehouse. Use this for getting numbers about how features are performing, user engagement, and business KPIs.
Parameters: {metric: string, time_range: string, dimensions: array}

Tool: search_dashboards
Description: Search the company's analytics dashboards for ones that match a query. Returns a list of dashboard URLs and titles.
Parameters: {query: string}

Tool: bash
Description: Run a bash command. Use this for system commands.
Parameters: {command: string}

Tool: read_file
Description: Read a file from the local filesystem.
Parameters: {path: string}
</tool-inventory>
<observed-misselections>
1. Task: 'What's our weekly active users for last month?' Agent picked search_dashboards (returned a dashboard URL), should have picked query_metrics or get_metric_data.
2. Task: 'Pull the conversion rate for the new onboarding flow.' Agent picked get_metric_data 60% of the time, query_metrics 40% — they're effectively duplicate tools.
3. Task: 'Get the latest deployment log.' Agent picked query_metrics, should have picked bash.
</observed-misselections>
<tasks-the-agent-handles>
Product analytics questions, deployment troubleshooting, dashboard discovery, ad-hoc data pulls for stakeholders.
</tasks-the-agent-handles>
<misselection-rate>~25-30%, mostly between query_metrics and get_metric_data.</misselection-rate>
🤖 Output

Tool-Use Calibration: Internal Product Analytics Agent

Inventory Health Summary

  • Tools: 5
  • Average description length: 28 words (acceptable, but distribution is uneven)
  • Major issue: query_metrics and get_metric_data are functionally identical tools. This alone causes ~60% of your misselection rate.
  • Secondary issue: bash and query_metrics have unclear domain boundaries. Agent picked query_metrics for a deployment-log task because nothing told it not to.
  • Missing: 'when not to use' clauses on every tool.
  • Overall grade: D. Will be A- after the recommended edits. The fixes are mechanical, not deep.

Per-Tool Audit

Tool 1: query_metrics ⚠️ DUPLICATE

Current: "This tool queries our analytics database for product metrics. You can use it to get information about user behavior, feature usage, conversion rates, and other product analytics data. The tool returns aggregated data over a time period." (39 words)

Issue: Functionally identical to get_metric_data. The agent has no principled basis to pick between them. Description is also repetitive ('queries our analytics database' + 'analytics data' + 'metrics').

Recommendation: DELETE this tool. Keep get_metric_data only. If they truly query different backends, rename to make the distinction explicit (e.g., query_warehouse_metrics vs query_realtime_metrics).

Tool 2: get_metric_data ✅ KEEP, REWRITE

Current: "Retrieves metric data from our internal data warehouse. Use this for getting numbers about how features are performing, user engagement, and business KPIs." (24 words)

Issues:

  • No examples
  • No 'when not to use'
  • 'time_range' parameter is vague (string? '7d'? '2026-04-01 to 2026-04-30'?)
  • 'dimensions' parameter type is array but not array of what

Rewrite:

Description: Pulls aggregated product metrics from our data warehouse for a time range, optionally grouped by dimensions.

Use for: numerical product analytics questions like 'WAU last month', 'conversion rate of feature X', 'retention curve cohort Y'.
Do NOT use for: dashboard discovery (use search_dashboards), deployment/system queries (use bash), or non-product data like financial metrics.
Examples:
- 'WAU for the last 4 weeks' → metric='weekly_active_users', time_range='2026-03-31..2026-04-27', dimensions=[]
- 'Onboarding conversion broken down by signup source' → metric='onboarding_conversion', time_range='2026-04-01..2026-04-27', dimensions=['signup_source']

Parameters:
  metric: snake_case metric name from our metric registry
  time_range: ISO date range as 'YYYY-MM-DD..YYYY-MM-DD' or relative as 'last_7d', 'last_30d', 'mtd', 'qtd'
  dimensions: array of dimension names from our dimension registry (e.g., ['country', 'plan_tier'])

New length: 116 words but with examples + clear boundaries. Net effect: tool gets selected accurately for analytics tasks.

Tool 3: search_dashboards ✅ KEEP, MINOR FIX

Current: "Search the company's analytics dashboards for ones that match a query. Returns a list of dashboard URLs and titles." (19 words)

Issue: Was getting selected for 'what's our WAU' because nothing said "don't use this when you need a number, only when you need to find a dashboard."

Rewrite:

Description: Find existing analytics dashboards matching a topic. Returns dashboard URLs + titles, NOT actual metric values.

Use for: dashboard discovery ('is there a dashboard about X?'), guiding stakeholders to existing visualizations.
Do NOT use for: actually answering metric questions (use get_metric_data) or one-off data pulls.
Examples:
- 'Find the activation funnel dashboard' → query='activation funnel'
- 'What dashboards do we have for revenue?' → query='revenue'
Tool 4: bash ✅ KEEP, REWRITE FOR SCOPE

Current: "Run a bash command. Use this for system commands." (9 words)

Issue: Description is too short. Agent doesn't know when to reach for it. Currently under-selected for legitimate system tasks AND occasionally picked when it shouldn't be.

Rewrite:

Description: Run a shell command for system, deployment, or filesystem tasks.

Use for: deployment logs, git operations, file inspection commands, kubectl/docker queries, log file searches.
Do NOT use for: querying our analytics warehouse (use get_metric_data) or finding dashboards (use search_dashboards).
Examples:
- 'Get the last deploy log' → command='kubectl logs -n prod deploy/web --tail=200'
- 'Check current git branch' → command='git branch --show-current'
Tool 5: read_file ✅ KEEP

Acceptable as-is. Add examples for production polish but not blocking.

Inventory-Wide Issues

1. Tool duplication (CRITICAL): query_metrics and get_metric_data overlap completely. Delete one.

2. Missing 'when not to use' across all 5 tools. This is the highest-leverage cross-cutting fix.

3. Tool ordering: Currently query_metrics is first. After deletion, place `get_metric_data` first since it handles the majority of agent tasks. Place `bash` last (most blast-radius).

4. No tool category prefixes. Consider grouping with prefix conventions: `analytics_query`, `analytics_search_dashboards`, `system_run_bash`, `system_read_file`. Helps agents pattern-match.

5. Parameter schemas are inconsistent between query_metrics and get_metric_data (different parameter names for the same concept). Standardize.

Calibration Tests

After applying recommended edits, the agent should pass these tests:

#TaskShould pickCommon wrong pick (before fix)
1"What's our WAU last month?"get_metric_datasearch_dashboards
2"Find the dashboard about churn."search_dashboardsget_metric_data
3"Get the last deploy log."bashquery_metrics
4"Look up the activation funnel numbers."get_metric_datasearch_dashboards
5"What dashboards do we have for retention?"search_dashboardsget_metric_data
6"Pull conversion rate for new onboarding flow."get_metric_data(currently 50/50 between two tools)
7"Read the README in the analytics-pipeline repo."read_filebash
8"Check kubectl pod status in staging."bash(correct already)

Run these manually after edits. Target: 7/8 correct selections.

Top 3 Edits to Make First

1. Delete query_metrics, keep get_metric_data with the new description. Single highest-leverage change. Drops misselection rate by ~15-20 percentage points alone.

2. Add 'Do NOT use for:' clauses to all 4 remaining tools. Drops misselection by another ~5-8 percentage points by giving the agent explicit boundaries.

3. Add concrete examples to get_metric_data and search_dashboards. Calibrates intent matching for the most-used tools.

These 3 edits alone should take your 25-30% misselection rate to <8%.

Anti-Patterns Found

1. Two-tools-one-job (query_metrics + get_metric_data): Built when the team thought separating warehouse-vs-realtime would clarify but in practice the agent has no signal. Either differentiate explicitly in the name OR merge.

2. Verb-only tool names (bash, read_file): Fine for tools whose semantics are obvious. Bad when tools could be confused with others. `system_bash` and `system_read_file` would group better.

3. Description verbosity in get_metric_data without ROI: Currently 24 words with no examples. Either invest in 100+ words with examples, or trim to 15 words. Middle ground hurts.

4. No 'do not use' clauses anywhere: This is the cross-cutting failure pattern. The negative space is where misselection lives.

Verification Method

1. Apply the 3 priority edits.

2. Run the 8 calibration tests above. Target: 7/8 correct.

3. Run for 1 week with logging enabled. Compare new misselection rate against baseline 25-30%.

4. If still >10% misselection: re-audit. Likely cause is tool count creeping up or new tasks the descriptions don't cover.

5. After 30 days of stable performance, re-audit anyway as a hygiene practice.

Re-Audit Schedule

  • Immediately after adding any new tool (new tools shift the selection landscape for ALL tools).
  • When misselection rate climbs above 10% for 7+ days.
  • Quarterly hygiene audit even if everything seems fine — descriptions decay as tasks evolve.
  • Before any production deploy of a customer-facing agent (use Pre-Launch Audit Mode variant).

Key Takeaways

  • You have one critical bug (duplicate tools) and one cross-cutting weakness (missing 'do not use' clauses). The fix takes <30 minutes of editing.
  • Tool descriptions are prompts. Audit them with the same rigor as your system prompt.
  • 'When not to use' beats 'when to use' as a misselection prevention strategy.
  • Tool count >12 starts costing accuracy. Stay under that ceiling; merge or remove tools that aren't pulling weight.
  • Re-audit on every tool addition. New tools shift the selection landscape for all existing tools, even if the existing tools weren't edited.

Common use cases

  • Engineer whose Claude Code agent keeps using Bash when it should use Read
  • Builder with an MCP server where the agent picks the wrong tool 30% of the time
  • Solo dev whose research agent calls Search when it has a better Database tool
  • Team adding new tools to an agent and worried about confusing the existing ones
  • Developer doing a quarterly tool-description audit on a production agent

Best AI model for this

Claude Opus 4. Tool-description audit is meta-prompt-engineering work — Claude's writing-and-reasoning combination is uniquely suited. ChatGPT GPT-5 second-best.

Pro tips

  • Test tool selection with paraphrased tasks. If 'find me X' picks Tool A and 'locate X' picks Tool B for the same intent, your descriptions are ambiguous.
  • Always include 'When not to use' in tool descriptions. Saying what a tool ISN'T for prevents 50% of misselection.
  • Tool names matter as much as descriptions. 'search_documents' and 'find_files' will get confused. Pick distinct verb-object phrasings.
  • Examples in descriptions beat abstract specs. 'Use this for: searching customer support tickets like "refund issue from 2024"' beats 'Use this for ticket search.'
  • Verbose descriptions hurt selection. Every word in every tool description is in EVERY agent turn's context. Keep them under 80 words each.
  • Order matters. Models attend more to tools listed first. Put your most-used tool first; put your dangerous-when-confused tool last with a strong 'when not to use'.
  • If two tools COULD be used for the same task, you have a tool-design problem, not a description problem. Merge them or differentiate by axis (read vs write, fast vs accurate).

Customization tips

  • Paste the FULL tool inventory, not summaries. The audit happens at the word-by-word level.
  • List specific observed misselections with task + picked tool + correct tool. Concrete examples are 10× more diagnostic than 'agent picks wrong sometimes'.
  • Estimate the misselection rate even if rough. Knowing whether it's 5% or 35% changes which fixes are highest-leverage.
  • If you have an MCP server, also describe what each tool actually does (the description) AND what API/system it hits. Sometimes the description is right but the underlying action is what's mismatched.
  • Run the calibration tests in a fresh agent session, not in your dev session — context contamination biases the test.
  • Save the audit output. After 3-4 audits over a quarter, patterns in YOUR specific tool descriptions become visible (you tend to over-verbose, you tend to skip examples, etc.) and the next audit gets faster.

Variants

Claude Code Mode

For tools registered with Claude Code agents — knows the platform's specific tool-calling patterns.

MCP Server Mode

For MCP server tool definitions — handles MCP-specific schema constraints.

Function-Calling Mode

For OpenAI function-calling or Anthropic tool-use APIs — focuses on the JSON schema.

Pre-Launch Audit Mode

For agents about to ship to production — adds adversarial testing of tool selection across edge cases.

Frequently asked questions

How do I use the Agent Tool-Use Calibrator prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Agent Tool-Use Calibrator?

Claude Opus 4. Tool-description audit is meta-prompt-engineering work — Claude's writing-and-reasoning combination is uniquely suited. ChatGPT GPT-5 second-best.

Can I customize the Agent Tool-Use Calibrator prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Test tool selection with paraphrased tasks. If 'find me X' picks Tool A and 'locate X' picks Tool B for the same intent, your descriptions are ambiguous.; Always include 'When not to use' in tool descriptions. Saying what a tool ISN'T for prevents 50% of misselection.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals