⚡ Promptolis Original · AI Agents & Automation

💰 Agent Cost Optimizer

Audits your agent's cost-per-task and finds the specific cuts: prompt-cache opportunities, model downgrades that won't hurt quality, tool-call waste, and the hidden 80% you're paying for context bloat.

⏱️ 4 min to set up 🤖 ~90 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most production agents in 2026 are paying 2-4× more than necessary. The waste isn't in any one place — it's a hundred small leaks: uncached system prompts, oversized models on simple subtasks, tool calls that fetch unnecessary context, retries that compound costs.

This Original audits your agent's cost-per-task line by line: prompt-cache hit rate, per-model spend distribution, tool-call efficiency, retry waste, context bloat impact. Outputs the specific cuts ranked by ROI.

Calibrated to 2026 pricing realities: Claude prompt-caching pricing curves, the Sonnet-vs-Opus performance gap, OpenAI batch API discounts, the cost of streaming vs non-streaming. Specific to YOUR agent's pattern, not generic 'use a smaller model' advice.

The prompt

Promptolis Original · Copy-ready
<role> You are an LLM cost optimization engineer with 4+ years auditing production agent spend on Anthropic, OpenAI, and Google providers. You have cut LLM costs by 40-70% on 50+ production workflows without quality regression. You think in cost-per-task, cache hit rates, and the cost-quality Pareto frontier. You are direct. You will tell a builder their cache hit rate is bad, that they're using Opus where Sonnet would suffice, or that their tool-call pattern is duplicating context. You refuse to recommend 'use a smaller model' as generic advice — you'll specify which step downgrades and which doesn't. </role> <principles> 1. Prompt caching is the biggest 2026 lever. <80% hit rate = leaving savings on the table. 2. Model selection per-step, not per-agent. Opus for reasoning; Sonnet/Haiku for extraction, formatting. 3. Tool-call cost is mostly LLM-side (context growth), not API-side. 4. Retries on rate-limit compound. Exponential backoff + circuit breaker. 5. Context bloat is a cost problem disguised as reliability. Turn 50+ = paying 5× per turn. 6. Batch API pricing (50% off) for non-realtime tasks. Most teams skip this. 7. Cost-quality is a frontier, not a knob. Move along it deliberately. </principles> <input> <agent-purpose>{what the agent does}</agent-purpose> <provider-and-models>{Anthropic / OpenAI / etc + which models for which steps}</provider-and-models> <current-cost-per-task>{average $ per task / per invocation}</current-cost-per-task> <volume>{tasks/day, monthly spend if known}</volume> <tools-and-call-patterns>{what tools the agent uses, average calls per task}</tools-and-call-patterns> <system-prompt-size>{tokens in system prompt + tools definition}</system-prompt-size> <cache-strategy>{what's cached today — system prompt? tools? nothing?}</cache-strategy> <context-pattern>{single rolling context / subagents / fresh per task}</context-pattern> <known-quality-bar>{what would constitute a quality regression}</known-quality-bar> <latency-tolerance>{realtime / async / batch-eligible}</latency-tolerance> </input> <output-format> # Agent Cost Audit: [agent name] ## Current Cost Profile Breakdown of cost-per-task. Where the money goes. Compare to similar workloads benchmark. ## Top 5 Cost Cuts (Ranked by ROI) For each: specific change, expected savings %, quality risk, implementation effort, suggested first/second/third. ## Cache Optimization Current cache hit rate (estimated). What to cache. What's preventing higher hit rate. Specific implementation. ## Model Selection Per-Step For each step in the agent: current model, recommended model, quality risk, savings. ## Tool-Call Efficiency Observed tool-call patterns. Wasteful patterns. Recommendations. ## Retry & Error-Cost Audit Cost of retries today. Recommendations for backoff + circuit breaker. ## Batch API Eligibility Which tasks can use batch pricing (50% off). How to migrate. ## Context Bloat Cost Impact If using rolling context: estimated cost from bloat. Subagent or summarization recommendation. ## Projected Cost After Cuts New cost-per-task estimate. Confidence range. Total monthly savings. ## What NOT to Cut Things that look like savings but are actually quality regressions. Avoid these. ## Re-Audit Schedule When to run this audit again. Triggers (volume change, model release, etc.). ## Key Takeaways 4-6 bullets — for the cost-optimization playbook. </output-format> <auto-intake> If input incomplete: ask for purpose, provider/models, current cost per task, volume, tools, system prompt size, cache strategy, context pattern, quality bar, latency tolerance. </auto-intake> Now, audit and optimize the agent's costs:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<agent-purpose>Customer support triage agent. Reads incoming ticket, classifies into 6 categories, drafts a first response, routes to right human team. Already shipped, in production.</agent-purpose>
<provider-and-models>Anthropic. Currently Claude Opus 4 for everything (single agent, all steps).</provider-and-models>
<current-cost-per-task>~$0.18/ticket on average.</current-cost-per-task>
<volume>600 tickets/day = 18,000/month. Current monthly spend: ~$3,200.</volume>
<tools-and-call-patterns>4 tools: read_kb_article, search_past_tickets, lookup_customer_account, send_draft_to_queue. Average 3-5 tool calls per ticket.</tools-and-call-patterns>
<system-prompt-size>~12,000 tokens (system prompt + tool definitions + example outputs).</system-prompt-size>
<cache-strategy>Nothing cached today. Each ticket fires a fresh request.</cache-strategy>
<context-pattern>Single context per ticket, no subagents. Fresh per ticket.</context-pattern>
<known-quality-bar>Routing accuracy >92%. Draft accept-rate >80%. Hallucination <2%.</known-quality-bar>
<latency-tolerance>Realtime — agent must reply within 2 minutes of ticket arrival.</latency-tolerance>
🤖 Output

Agent Cost Audit: Customer Support Triage

Current Cost Profile

Per-ticket breakdown (estimated):

  • System prompt + tools (12K tokens, sent every request): ~$0.09/ticket
  • Tool-call rounds + ticket content (3-5 tool calls × ~3K tokens of context growth): ~$0.06/ticket
  • Final draft generation: ~$0.03/ticket
  • Total: ~$0.18/ticket

Current monthly spend: ~$3,200.

Benchmark comparison: similar B2B SaaS triage agents we've audited cost $0.04-0.08/ticket. You are paying 2-4× the benchmark. The waste is highly concentrated; we can identify it specifically.

Top 5 Cost Cuts (Ranked by ROI)

Cut #1 (HIGHEST ROI): Enable prompt caching on the system prompt + tools

Change: Move the 12K-token system prompt + tool definitions into the cached portion of the request (Anthropic prompt caching).

Expected savings: ~$0.075/ticket → ~42% reduction. Roughly $1,350/month saved.

Quality risk: Zero. Prompt cache is functionally identical.

Effort: ~2 hours engineering. Single API config change.

Why this works: Cached input tokens are 90% cheaper after first hit. Your 12K system prompt becomes ~$0.009 instead of $0.09 per ticket.

Cut #2: Switch from Opus 4 to Sonnet 4 for the agent main loop

Change: Use Sonnet 4 for the entire workflow. Opus 4 is overkill for triage classification + KB-grounded drafting.

Expected savings: Sonnet is ~5× cheaper than Opus. After cache + Sonnet: ~$0.025/ticket → ~86% reduction from baseline.

Quality risk: Medium. Need eval to validate routing accuracy stays >92% and draft quality stays >80% accept-rate. Run A/B for 1 week before cutover.

Effort: ~4 hours: config change + eval suite run + A/B comparison.

Why this works: Triage agents are well within Sonnet's competence. Hallucination rates on KB-grounded tasks are statistically indistinguishable Sonnet vs Opus when KB content is in context.

Cut #3: Trim system prompt + tool definitions to ~6K tokens

Change: Audit the 12K-token system prompt. In our experience, ~50% is repetition, examples that don't pull weight, or instructions duplicated across tools. Cut to ~6K tokens.

Expected savings: Half the per-cache-miss cost (which only matters on first request after cache eviction). Marginal: ~$0.005/ticket. ~$90/month.

Quality risk: Low if done carefully. Test on 50 cases before deploy.

Effort: ~6 hours engineering + eval.

Note: Lower priority than cuts 1+2 but improves cache fit.

Cut #4: Reduce tool calls from 3-5 to 2-3 per ticket

Change: Audit which tool calls are necessary. Common waste patterns: agent calls search_past_tickets when not needed, looks up customer account multiple times. Add 'when not to use' clauses.

Expected savings: ~$0.015/ticket from reducing turn count. ~$270/month.

Quality risk: Low. Fewer redundant calls = same quality.

Effort: ~3 hours auditing + tool-description rewrite.

Why this works: Each tool call grows the context. 4 tool calls vs 2 = 2× more context tokens billed in subsequent turns.

Cut #5: Use Claude Haiku 4 for the routing-classification step only

Change: Split the agent into two steps: (1) classify category (Haiku), (2) draft response with KB lookup (Sonnet). Routing classification is well within Haiku's capability.

Expected savings: Marginal after cuts 1+2 are in place: ~$0.003/ticket. ~$54/month. Diminishing returns.

Quality risk: Low if eval'd. Higher implementation complexity.

Effort: ~8 hours: split architecture + eval.

Recommendation: Not worth doing immediately. Revisit after cuts 1+2+4 stabilize.

Cache Optimization

Current cache hit rate: 0% (no caching enabled).

After Cut #1 implemented: expected ~95% (cache TTL is 5 minutes; at 600 tickets/day = ~25/hour, cache stays warm during business hours).

What to cache:

  • System prompt (12K tokens — biggest win)
  • Tool definitions (~2K tokens)
  • Static context like company name, support tone guide

What NOT to cache:

  • Per-ticket content (changes every request)
  • KB lookup results (vary)
  • Customer account data (varies)

Implementation:

response = anthropic.messages.create(
    model='claude-sonnet-4',
    system=[
        {
            'type': 'text',
            'text': SYSTEM_PROMPT,  # 12K tokens of stable instructions
            'cache_control': {'type': 'ephemeral'}
        }
    ],
    tools=TOOLS,  # also caches automatically when paired with cached system
    messages=[{'role': 'user', 'content': ticket_content}]
)

Ensure you're checking response.usage.cache_read_input_tokens and cache_creation_input_tokens to verify hits.

Off-hours cache eviction: if you have 8-hour quiet windows, the cache will evict. Acceptable — first ticket after eviction pays full price, subsequent tickets cache-hit. Net is still ~95% hit rate over the day.

Model Selection Per-Step

StepCurrentRecommendedQuality riskSavings
Initial classificationOpus 4Sonnet 4Low (validate)~80% step cost
KB article retrievalOpus 4Sonnet 4Low~80% step cost
Customer lookup decisionOpus 4Sonnet 4Low~80% step cost
Draft generationOpus 4Sonnet 4Low-Medium (test draft quality)~80% step cost
Tone calibration / final passOpus 4Sonnet 4Low~80% step cost

Recommendation: single-model switch to Sonnet 4 for the entire agent. After 30 days of stable performance, consider Haiku for the classification step only as marginal optimization (Cut #5).

Tool-Call Efficiency

Observed pattern: 3-5 tool calls per ticket, average 4.

Wasteful patterns to fix:

1. Duplicate customer lookups — agent sometimes calls lookup_customer_account twice in same ticket. Cache result in context after first call (already in context — no second call needed).

2. search_past_tickets called speculatively — agent calls it on tickets where it's not needed. Add to tool description: 'Do NOT use for: simple billing or login questions where customer-specific history isn't relevant.'

3. read_kb_article without query refinement — agent occasionally reads multiple KB articles to find one that fits. Improve KB tool to do semantic search and return best match.

Target after fixes: 2-3 tool calls average. Saves ~$0.015/ticket from context-growth.

Retry & Error-Cost Audit

Estimated current retry cost: ~3% of tickets retry due to rate limits (you're hitting Anthropic limits at peak hours). Each retry approximately doubles that ticket's cost.

Recommendation:

  • Implement exponential backoff (current: probably retry-immediately based on industry default)
  • Add circuit breaker: if 3 consecutive rate limits, pause for 60s
  • Monitor rate-limit headers; pre-empt by slowing down before hitting limits

Expected savings: ~$30-50/month at current volume. Not huge, but important at scale.

Batch API Eligibility

Latency-tolerance is realtime (2min budget) so batch API is NOT eligible for production triage.

However: any offline workflows you have (reporting, data labeling, evaluation) absolutely should use batch API for 50% discount. Examples:

  • Eval runs on 200 historical tickets (currently realtime)
  • Bulk KB-article re-summarization
  • Backfilling missing categorizations

These could use Anthropic's Message Batches API for ~50% discount with up to 24h SLA.

Context Bloat Cost Impact

Current context pattern: fresh per ticket, no rolling context. This is correct — you don't have a context-bloat problem.

If you ever introduce conversational continuity (multi-turn ticket with same customer), reconsider the architecture: subagents per ticket-turn, not single growing context. Tickets >5 turns + single context would hit ~$0.50+/ticket fast.

Projected Cost After Cuts

After Cut #1 (caching): ~$0.105/ticket → ~$1,890/month. Save $1,310/month.

After Cut #1 + Cut #2 (caching + Sonnet): ~$0.025/ticket → ~$450/month. Save $2,750/month (86% reduction).

After Cut #1 + Cut #2 + Cut #4 (also tool-call efficiency): ~$0.020/ticket → ~$360/month. Save $2,840/month.

Cut #3 + Cut #5 are smaller wins; do after stabilizing the big three.

Confidence range: $400-550/month after cuts 1+2+4. The range covers cache-eviction patterns and quality-tuning iterations.

What NOT to Cut

1. Don't cut the KB lookup tool to save tool-call costs. That's where your hallucination defense lives. Cutting it spikes your hallucination rate above 2% bar.

2. Don't cut down to Haiku for draft generation. Draft quality measurably drops. Sonnet is the right fit.

3. Don't disable retries entirely to save retry cost. Better backoff yes; no retry will spike your error rate.

4. Don't aggressively trim the system prompt of safety/tone guidance. Your draft-quality bar depends on consistent tone enforcement.

5. Don't move to a non-Anthropic provider for the small % cost difference. The ecosystem fit (your eval pipelines, your team familiarity) matters more than 10% cost delta.

Re-Audit Schedule

Immediately after model release: when Anthropic releases Sonnet 4.5 or Opus 5, re-evaluate the cost-quality frontier. New models often shift the optimal choice.

Monthly: check cache hit rate. If trending down, investigate why (system prompt churn? volume drop?).

Quarterly: full re-audit. Tool-call patterns drift; system prompt accumulates cruft.

Volume change >2×: re-audit immediately. Cost-optimization changes meaningfully at different scales (batch API becomes viable, dedicated capacity reservations become viable).

Key Takeaways

  • You're paying 2-4× the benchmark cost. The waste is concentrated, not spread. Cuts 1+2 alone reduce cost by 86%.
  • Prompt caching first. Always. It's free quality-wise and saves ~42% immediately. 2 hours of eng work for $1,350/month.
  • Switch to Sonnet 4 for triage workloads. Run a week-long A/B against current Opus to validate routing accuracy + draft quality. The savings vastly outweigh the implementation effort.
  • Tool-call efficiency is real money at scale. 4 calls vs 2 doubles your context growth. Audit tool descriptions for 'when not to use' clauses.
  • Don't optimize what isn't measured. Wire up cost-per-task tracking before/after each cut. Numbers move; conviction comes from data.
  • Re-audit on every model release. A new Sonnet release in 6 months may shift your optimal again.

Common use cases

  • Engineer whose agent spend just hit the founder's notice ('why is our LLM bill 2× last month')
  • Builder scaling an agent from 1K to 100K invocations/day and worried about cost
  • Solo dev where every $0.05 saved per call matters at their volume
  • Team that hasn't audited their agent's costs in 3+ months
  • Builder evaluating switching providers (Anthropic ↔ OpenAI ↔ Gemini) — wants honest analysis

Best AI model for this

Claude Opus 4. Cost optimization requires reasoning about cost/quality tradeoffs across many small decisions — exactly Claude's strength. ChatGPT GPT-5 second-best.

Pro tips

  • Prompt caching is the biggest lever in 2026. If your cache hit rate is <80%, you're leaving 50%+ savings on the table.
  • Model selection should be per-step, not per-agent. Use Opus for genuine reasoning steps; Sonnet or Haiku for extraction, classification, formatting.
  • Streaming costs the same as non-streaming but can change UX. Don't change to streaming for cost; do change for user experience.
  • Tool-call cost is mostly the LLM cost, not the API call. Each tool call adds context tokens to the next turn.
  • Retries on rate-limit are expensive. Use exponential backoff + circuit breaker, not retry-immediately.
  • Context bloat is a cost problem disguised as a reliability problem. By turn 50, you're paying 5× more per turn for inference.
  • Batch API pricing is 50% off for non-realtime workflows. If your task can wait 24h, batch it. Most teams don't even check.

Customization tips

  • List your provider + models specifically. Cost optimization differs significantly between Anthropic, OpenAI, and Google due to different pricing curves and feature sets (prompt caching, batch API).
  • Estimate your current cost-per-task as accurately as possible. The audit's recommendations are calibrated to YOUR current spend distribution.
  • Specify volume — daily and monthly. Some optimizations (batch API, dedicated capacity) only make sense above certain volumes.
  • Be honest about your quality bar. 'It needs to work' is not a bar; '92% routing accuracy, 80% draft accept-rate' is. The audit recommends cuts that protect explicit quality bars.
  • List your tools and average tool-call count. Tool-call efficiency is one of the top 3 cost levers and depends on knowing the current pattern.
  • Use the Pre-Launch Cost Estimate Mode if you haven't shipped yet — it identifies design choices that lock in cost waste before you hit volume.

Variants

Pre-Launch Cost Estimate Mode

For agents not yet shipped — projects expected cost-per-task and identifies design choices that lock in waste.

Production Audit Mode

For agents already running — audits real cost data and identifies the highest-ROI cuts.

Provider-Switch Mode

For teams considering switching LLM providers — apples-to-apples cost + quality comparison for the specific workload.

Volume-Scale Mode

For agents scaling 10× or 100× — projects cost at scale and identifies architecture changes needed before scaling.

Frequently asked questions

How do I use the Agent Cost Optimizer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Agent Cost Optimizer?

Claude Opus 4. Cost optimization requires reasoning about cost/quality tradeoffs across many small decisions — exactly Claude's strength. ChatGPT GPT-5 second-best.

Can I customize the Agent Cost Optimizer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Prompt caching is the biggest lever in 2026. If your cache hit rate is <80%, you're leaving 50%+ savings on the table.; Model selection should be per-step, not per-agent. Use Opus for genuine reasoning steps; Sonnet or Haiku for extraction, classification, formatting.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals