⚡ Promptolis Original · AI Agents & Automation
🎚️ Agent Tool-Use Calibrator
Audits your agent's tool descriptions and picking patterns — fixes the misselection problems that cause 60% of 'agent did the wrong thing' bugs without you needing to retrain anything.
Why this is epic
Tool misselection is the most common agent failure that builders blame on the model. It's almost always your tool descriptions. This Original audits them line-by-line.
Outputs a per-tool diagnosis: what the description currently says, why the agent picks it wrongly, the specific rewrite, and a calibration test to confirm the fix.
Includes the 7 specific patterns that produce tool-misselection: name overlap, description verbosity, missing 'when not to use', overlapping parameter schemas, ambiguous I/O contracts, missing examples, missing failure modes.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<agent-platform>Claude Code with custom MCP server for our internal product analytics</agent-platform>
<tool-inventory>
Tool: query_metrics
Description: This tool queries our analytics database for product metrics. You can use it to get information about user behavior, feature usage, conversion rates, and other product analytics data. The tool returns aggregated data over a time period.
Parameters: {metric_name: string, start_date: string, end_date: string, group_by: string}
Tool: get_metric_data
Description: Retrieves metric data from our internal data warehouse. Use this for getting numbers about how features are performing, user engagement, and business KPIs.
Parameters: {metric: string, time_range: string, dimensions: array}
Tool: search_dashboards
Description: Search the company's analytics dashboards for ones that match a query. Returns a list of dashboard URLs and titles.
Parameters: {query: string}
Tool: bash
Description: Run a bash command. Use this for system commands.
Parameters: {command: string}
Tool: read_file
Description: Read a file from the local filesystem.
Parameters: {path: string}
</tool-inventory>
<observed-misselections>
1. Task: 'What's our weekly active users for last month?' Agent picked search_dashboards (returned a dashboard URL), should have picked query_metrics or get_metric_data.
2. Task: 'Pull the conversion rate for the new onboarding flow.' Agent picked get_metric_data 60% of the time, query_metrics 40% — they're effectively duplicate tools.
3. Task: 'Get the latest deployment log.' Agent picked query_metrics, should have picked bash.
</observed-misselections>
<tasks-the-agent-handles>
Product analytics questions, deployment troubleshooting, dashboard discovery, ad-hoc data pulls for stakeholders.
</tasks-the-agent-handles>
<misselection-rate>~25-30%, mostly between query_metrics and get_metric_data.</misselection-rate>
Tool-Use Calibration: Internal Product Analytics Agent
Inventory Health Summary
- Tools: 5
- Average description length: 28 words (acceptable, but distribution is uneven)
- Major issue: query_metrics and get_metric_data are functionally identical tools. This alone causes ~60% of your misselection rate.
- Secondary issue: bash and query_metrics have unclear domain boundaries. Agent picked query_metrics for a deployment-log task because nothing told it not to.
- Missing: 'when not to use' clauses on every tool.
- Overall grade: D. Will be A- after the recommended edits. The fixes are mechanical, not deep.
Per-Tool Audit
Tool 1: query_metrics ⚠️ DUPLICATE
Current: "This tool queries our analytics database for product metrics. You can use it to get information about user behavior, feature usage, conversion rates, and other product analytics data. The tool returns aggregated data over a time period." (39 words)
Issue: Functionally identical to get_metric_data. The agent has no principled basis to pick between them. Description is also repetitive ('queries our analytics database' + 'analytics data' + 'metrics').
Recommendation: DELETE this tool. Keep get_metric_data only. If they truly query different backends, rename to make the distinction explicit (e.g., query_warehouse_metrics vs query_realtime_metrics).
Tool 2: get_metric_data ✅ KEEP, REWRITE
Current: "Retrieves metric data from our internal data warehouse. Use this for getting numbers about how features are performing, user engagement, and business KPIs." (24 words)
Issues:
- No examples
- No 'when not to use'
- 'time_range' parameter is vague (string? '7d'? '2026-04-01 to 2026-04-30'?)
- 'dimensions' parameter type is array but not array of what
Rewrite:
Description: Pulls aggregated product metrics from our data warehouse for a time range, optionally grouped by dimensions.
Use for: numerical product analytics questions like 'WAU last month', 'conversion rate of feature X', 'retention curve cohort Y'.
Do NOT use for: dashboard discovery (use search_dashboards), deployment/system queries (use bash), or non-product data like financial metrics.
Examples:
- 'WAU for the last 4 weeks' → metric='weekly_active_users', time_range='2026-03-31..2026-04-27', dimensions=[]
- 'Onboarding conversion broken down by signup source' → metric='onboarding_conversion', time_range='2026-04-01..2026-04-27', dimensions=['signup_source']
Parameters:
metric: snake_case metric name from our metric registry
time_range: ISO date range as 'YYYY-MM-DD..YYYY-MM-DD' or relative as 'last_7d', 'last_30d', 'mtd', 'qtd'
dimensions: array of dimension names from our dimension registry (e.g., ['country', 'plan_tier'])
New length: 116 words but with examples + clear boundaries. Net effect: tool gets selected accurately for analytics tasks.
Tool 3: search_dashboards ✅ KEEP, MINOR FIX
Current: "Search the company's analytics dashboards for ones that match a query. Returns a list of dashboard URLs and titles." (19 words)
Issue: Was getting selected for 'what's our WAU' because nothing said "don't use this when you need a number, only when you need to find a dashboard."
Rewrite:
Description: Find existing analytics dashboards matching a topic. Returns dashboard URLs + titles, NOT actual metric values.
Use for: dashboard discovery ('is there a dashboard about X?'), guiding stakeholders to existing visualizations.
Do NOT use for: actually answering metric questions (use get_metric_data) or one-off data pulls.
Examples:
- 'Find the activation funnel dashboard' → query='activation funnel'
- 'What dashboards do we have for revenue?' → query='revenue'
Tool 4: bash ✅ KEEP, REWRITE FOR SCOPE
Current: "Run a bash command. Use this for system commands." (9 words)
Issue: Description is too short. Agent doesn't know when to reach for it. Currently under-selected for legitimate system tasks AND occasionally picked when it shouldn't be.
Rewrite:
Description: Run a shell command for system, deployment, or filesystem tasks.
Use for: deployment logs, git operations, file inspection commands, kubectl/docker queries, log file searches.
Do NOT use for: querying our analytics warehouse (use get_metric_data) or finding dashboards (use search_dashboards).
Examples:
- 'Get the last deploy log' → command='kubectl logs -n prod deploy/web --tail=200'
- 'Check current git branch' → command='git branch --show-current'
Tool 5: read_file ✅ KEEP
Acceptable as-is. Add examples for production polish but not blocking.
Inventory-Wide Issues
1. Tool duplication (CRITICAL): query_metrics and get_metric_data overlap completely. Delete one.
2. Missing 'when not to use' across all 5 tools. This is the highest-leverage cross-cutting fix.
3. Tool ordering: Currently query_metrics is first. After deletion, place `get_metric_data` first since it handles the majority of agent tasks. Place `bash` last (most blast-radius).
4. No tool category prefixes. Consider grouping with prefix conventions: `analytics_query`, `analytics_search_dashboards`, `system_run_bash`, `system_read_file`. Helps agents pattern-match.
5. Parameter schemas are inconsistent between query_metrics and get_metric_data (different parameter names for the same concept). Standardize.
Calibration Tests
After applying recommended edits, the agent should pass these tests:
| # | Task | Should pick | Common wrong pick (before fix) |
|---|---|---|---|
| 1 | "What's our WAU last month?" | get_metric_data | search_dashboards |
| 2 | "Find the dashboard about churn." | search_dashboards | get_metric_data |
| 3 | "Get the last deploy log." | bash | query_metrics |
| 4 | "Look up the activation funnel numbers." | get_metric_data | search_dashboards |
| 5 | "What dashboards do we have for retention?" | search_dashboards | get_metric_data |
| 6 | "Pull conversion rate for new onboarding flow." | get_metric_data | (currently 50/50 between two tools) |
| 7 | "Read the README in the analytics-pipeline repo." | read_file | bash |
| 8 | "Check kubectl pod status in staging." | bash | (correct already) |
Run these manually after edits. Target: 7/8 correct selections.
Top 3 Edits to Make First
1. Delete query_metrics, keep get_metric_data with the new description. Single highest-leverage change. Drops misselection rate by ~15-20 percentage points alone.
2. Add 'Do NOT use for:' clauses to all 4 remaining tools. Drops misselection by another ~5-8 percentage points by giving the agent explicit boundaries.
3. Add concrete examples to get_metric_data and search_dashboards. Calibrates intent matching for the most-used tools.
These 3 edits alone should take your 25-30% misselection rate to <8%.
Anti-Patterns Found
1. Two-tools-one-job (query_metrics + get_metric_data): Built when the team thought separating warehouse-vs-realtime would clarify but in practice the agent has no signal. Either differentiate explicitly in the name OR merge.
2. Verb-only tool names (bash, read_file): Fine for tools whose semantics are obvious. Bad when tools could be confused with others. `system_bash` and `system_read_file` would group better.
3. Description verbosity in get_metric_data without ROI: Currently 24 words with no examples. Either invest in 100+ words with examples, or trim to 15 words. Middle ground hurts.
4. No 'do not use' clauses anywhere: This is the cross-cutting failure pattern. The negative space is where misselection lives.
Verification Method
1. Apply the 3 priority edits.
2. Run the 8 calibration tests above. Target: 7/8 correct.
3. Run for 1 week with logging enabled. Compare new misselection rate against baseline 25-30%.
4. If still >10% misselection: re-audit. Likely cause is tool count creeping up or new tasks the descriptions don't cover.
5. After 30 days of stable performance, re-audit anyway as a hygiene practice.
Re-Audit Schedule
- Immediately after adding any new tool (new tools shift the selection landscape for ALL tools).
- When misselection rate climbs above 10% for 7+ days.
- Quarterly hygiene audit even if everything seems fine — descriptions decay as tasks evolve.
- Before any production deploy of a customer-facing agent (use Pre-Launch Audit Mode variant).
Key Takeaways
- You have one critical bug (duplicate tools) and one cross-cutting weakness (missing 'do not use' clauses). The fix takes <30 minutes of editing.
- Tool descriptions are prompts. Audit them with the same rigor as your system prompt.
- 'When not to use' beats 'when to use' as a misselection prevention strategy.
- Tool count >12 starts costing accuracy. Stay under that ceiling; merge or remove tools that aren't pulling weight.
- Re-audit on every tool addition. New tools shift the selection landscape for all existing tools, even if the existing tools weren't edited.
Common use cases
- Engineer whose Claude Code agent keeps using Bash when it should use Read
- Builder with an MCP server where the agent picks the wrong tool 30% of the time
- Solo dev whose research agent calls Search when it has a better Database tool
- Team adding new tools to an agent and worried about confusing the existing ones
- Developer doing a quarterly tool-description audit on a production agent
Best AI model for this
Claude Opus 4. Tool-description audit is meta-prompt-engineering work — Claude's writing-and-reasoning combination is uniquely suited. ChatGPT GPT-5 second-best.
Pro tips
- Test tool selection with paraphrased tasks. If 'find me X' picks Tool A and 'locate X' picks Tool B for the same intent, your descriptions are ambiguous.
- Always include 'When not to use' in tool descriptions. Saying what a tool ISN'T for prevents 50% of misselection.
- Tool names matter as much as descriptions. 'search_documents' and 'find_files' will get confused. Pick distinct verb-object phrasings.
- Examples in descriptions beat abstract specs. 'Use this for: searching customer support tickets like "refund issue from 2024"' beats 'Use this for ticket search.'
- Verbose descriptions hurt selection. Every word in every tool description is in EVERY agent turn's context. Keep them under 80 words each.
- Order matters. Models attend more to tools listed first. Put your most-used tool first; put your dangerous-when-confused tool last with a strong 'when not to use'.
- If two tools COULD be used for the same task, you have a tool-design problem, not a description problem. Merge them or differentiate by axis (read vs write, fast vs accurate).
Customization tips
- Paste the FULL tool inventory, not summaries. The audit happens at the word-by-word level.
- List specific observed misselections with task + picked tool + correct tool. Concrete examples are 10× more diagnostic than 'agent picks wrong sometimes'.
- Estimate the misselection rate even if rough. Knowing whether it's 5% or 35% changes which fixes are highest-leverage.
- If you have an MCP server, also describe what each tool actually does (the description) AND what API/system it hits. Sometimes the description is right but the underlying action is what's mismatched.
- Run the calibration tests in a fresh agent session, not in your dev session — context contamination biases the test.
- Save the audit output. After 3-4 audits over a quarter, patterns in YOUR specific tool descriptions become visible (you tend to over-verbose, you tend to skip examples, etc.) and the next audit gets faster.
Variants
Claude Code Mode
For tools registered with Claude Code agents — knows the platform's specific tool-calling patterns.
MCP Server Mode
For MCP server tool definitions — handles MCP-specific schema constraints.
Function-Calling Mode
For OpenAI function-calling or Anthropic tool-use APIs — focuses on the JSON schema.
Pre-Launch Audit Mode
For agents about to ship to production — adds adversarial testing of tool selection across edge cases.
Frequently asked questions
How do I use the Agent Tool-Use Calibrator prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Agent Tool-Use Calibrator?
Claude Opus 4. Tool-description audit is meta-prompt-engineering work — Claude's writing-and-reasoning combination is uniquely suited. ChatGPT GPT-5 second-best.
Can I customize the Agent Tool-Use Calibrator prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Test tool selection with paraphrased tasks. If 'find me X' picks Tool A and 'locate X' picks Tool B for the same intent, your descriptions are ambiguous.; Always include 'When not to use' in tool descriptions. Saying what a tool ISN'T for prevents 50% of misselection.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals