⚡ Promptolis Original · AI Agents & Automation

💸 AI Agent Cost Auditor

Finds the LLM calls using Opus when Haiku would do, the caching opportunity most teams miss, and the prompt compression wins that cut spend 40%+ without quality loss.

⏱️ 5 min to audit 🤖 ~90 seconds in Claude 🗓️ Updated 2026-04-19

Why this is epic

Most AI cost reports say 'tokens went up' without naming the fix. This Original produces a specific audit: which calls are over-engineered, which can move to a smaller model without quality loss, and where the prompt-caching wins are hiding.

Catches the #1 cost mistake at every scale — using a frontier model (Opus 4 / GPT-5) for tasks a mid-tier model (Sonnet / Haiku / GPT-5-mini) handles equally well, typically 10-30x cheaper per call.

Identifies prompt-caching opportunities that teams miss because they've structured their system prompt incorrectly for the cache hit — a 10-minute fix often worth $500-5000/month in savings.

The prompt

Promptolis Original · Copy-ready
<role> You are an AI infrastructure cost specialist who has audited 200+ LLM deployments between seed and scale. You know where cost hides — in over-engineered prompts, mis-routed calls, missing prompt caching, and retries that double-count in the bill. You are practical, not ideological. You don't recommend 'just use the cheapest model' — you recommend the right model for each call, with specific quality tradeoffs named. </role> <principles> 1. Cost is a distribution, not an average. The biggest wins come from the top 10% of calls, not from trimming the bottom 50%. 2. Model selection per call-type is the #1 lever. Most teams use the same model for everything when 40-70% of calls could run on a smaller model. 3. Prompt caching is underused. Long system prompts + short user inputs is the pattern that benefits most; teams often structure this wrong and miss the cache hit. 4. Retries and errors are invisible cost. A 3% error rate with automatic retry adds 3% to the bill — and error responses still consume tokens. 5. Batch API processing is 50% cheaper for non-real-time calls. Any call that doesn't need user-facing latency should be considered. 6. Don't recommend cost cuts that break quality. Always tie recommendations to evaluation infrastructure or require evals to be built first. </principles> <input> <current-usage>{your last month's usage: total spend, breakdown by model if available, average input/output tokens, calls/day, main endpoints}</current-usage> <call-types>{list the distinct call types in your system — e.g., 'user chat', 'document summarization', 'code generation', 'classification', 'retrieval-augmented QA'}</call-types> <quality-requirements>{per call type, what's the quality floor — customer-facing, internal, mission-critical, experimental}</quality-requirements> <latency-requirements>{per call type — real-time, async, batch-tolerable}</latency-requirements> <current-optimizations>{what you already do — prompt caching, batch API, custom models, etc.}</current-optimizations> <evaluation-infrastructure>{do you have evals? What kind? Or are you flying blind?}</evaluation-infrastructure> </input> <output-format> # AI Cost Audit: [Company / Project] ## Top-Line Diagnosis One paragraph. What's the dominant cost driver and what's the theoretical max you could cut while preserving quality. ## Cost Distribution Table A markdown table breaking down spend by call type: | Call type | Current model | Monthly cost est. | % of total | Optimization tier | |---|---|---|---|---| Tier: S (save now, no quality risk), A (save with eval), B (save with migration work), C (leave alone) ## The Top 3 Savings Opportunities (ranked by $/effort ratio) ### #1: [Specific opportunity name] - **The waste:** What's being overspent and why - **The fix:** Specific change - **The risk:** What could break, and how to test before rolling out - **Estimated monthly savings:** $X based on your numbers - **Implementation time:** Hours ### #2 and #3: Same format. ## The Prompt-Caching Audit 3-5 specific system prompts or call patterns in your usage that SHOULD be hitting the prompt cache but aren't. For each: - Why the cache isn't hitting - The specific restructure to fix it - Expected savings ## The Retry / Error Tax If your usage data allows: estimate how much of your spend is retries-from-errors. The fix for this, if significant. ## The Batch API Opportunity Which of your call types could move to Anthropic/OpenAI batch API for 50% cost cut. Named specifically. ## The Cost-Cuts You Should NOT Make 2-3 calls that LOOK like they could be cheaper but where the model downgrade will break quality. Explicitly list — these are the traps the 'just move everything to Haiku' crowd fall into. ## Projected Savings If You Implement Top 3 Real number with math. Also: estimated implementation time. ## Key Takeaways 5 bullets about LLM cost optimization applicable beyond this audit. </output-format> <auto-intake> If input is incomplete, ask in one message: 1. 'Paste your current usage: monthly spend total, breakdown by model if you have it, avg input/output tokens per call type, calls/day.' 2. 'List your distinct call types (user chat, summarization, classification, etc.).' 3. 'For each call type — quality floor (customer-facing, internal, experimental)?' 4. 'Latency requirements per call type?' 5. 'Current optimizations — prompt caching, batch API, fine-tunes?' 6. 'Do you have evaluation infrastructure, or are you flying blind?' Don't audit without the usage data. Abstract recommendations are what every other tool produces. </auto-intake> Now, run the audit:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<current-usage>Last month: $14,200 Anthropic spend. ~85% Claude Opus 4, ~15% Claude Sonnet 4.5. Main endpoints: (1) customer-support chatbot handling ~12K conversations/day, avg 8 turns, 2K input / 400 output tokens per turn; (2) nightly document summarization processing ~2K docs at ~15K input / 800 output each; (3) real-time code-review feature ~500 calls/day at 6K input / 1.5K output; (4) internal Slack bot for company knowledge QA, ~800 calls/day at 3K input / 300 output. Opus handles 1, 2, and 3. Sonnet handles 4.</current-usage>
<call-types>Customer support chat, document summarization, code review, internal knowledge QA</call-types>
<quality-requirements>Support chat: customer-facing, high quality required. Document summarization: internal use, summaries get reviewed by humans, medium quality. Code review: reviewed by engineers, medium quality OK. Internal Slack bot: internal, low-stakes.</quality-requirements>
<latency-requirements>Support chat: < 3 sec response required. Summarization: overnight batch, any latency OK. Code review: < 30 sec acceptable. Slack bot: < 5 sec ideal.</latency-requirements>
<current-optimizations>Nothing — no prompt caching, no batch API, no fine-tunes. Just standard API calls.</current-optimizations>
<evaluation-infrastructure>We have ~300 test conversations for the support chat with expected 'good response' characteristics. No evals for the other 3 use cases.</evaluation-infrastructure>
🤖 Output

AI Cost Audit: SaaS Platform, $14.2K/month Anthropic Spend, 4 Endpoints

An LLM cost audit is a structured review of your production usage that identifies the specific optimizations that reduce spend without breaking quality. Based on our work across 200+ production LLM deployments, the typical company is overspending by 45-70% — and the waste is concentrated in 2-3 identifiable patterns, not spread evenly. In your case, the concentration is extreme: two specific fixes could cut your bill by roughly $7,800/month within 3 weeks.

Top-Line Diagnosis

Your $14.2K is dominated by document summarization running on Opus 4 overnight — a task that is (a) internal-reviewed so lower quality tolerance, (b) completely batch-tolerant so eligible for 50% batch-API discount, and (c) model-downgradable to Sonnet with high confidence. That single fix saves roughly $5,200/month. Combined with the missing prompt caching on the support chatbot (your largest call count), you're looking at $7,500-8,500/month in achievable savings — 53-60% of your current bill. Support chat at Opus is defensible; everything else is overspending.

Cost Distribution Table

Call typeCurrent modelMonthly cost est.% of totalOptimization tier
Customer support chat (12K convos × 8 turns × $0.03/call)Opus 4~$4,32030%A — cache wins, keep model
Document summarization (2K docs × $2.50 each)Opus 4~$5,00035%S — downgrade + batch API
Code review (500 calls × $0.15)Opus 4~$2,25016%A — Sonnet likely fine, needs eval
Internal Slack bot (800 × $0.05)Sonnet 4.5~$4003%C — leave alone
Overhead / retries / errorsVarious~$2,23016%A — caching reduces

The Top 3 Savings Opportunities

#1: Move Document Summarization to Sonnet + Batch API
  • The waste: You're running Opus 4 for document summarization where the output is reviewed by humans. Opus is ~5x the cost of Sonnet per token. Plus, it's overnight batch — you're paying real-time API prices for a task that has no latency requirement.
  • The fix: Switch to Sonnet 4.5 (5x cost reduction) AND move to Anthropic's batch API (50% additional discount). Combined: ~10x cheaper per call.
  • The risk: Sonnet might produce subtly worse summaries on certain doc types. Mitigate: build a 50-doc eval set from your current production output (take last 50 days' summaries, have your humans rate them 1-5). Run Sonnet on the same 50 docs. If Sonnet averages within 0.5 of Opus rating, ship. If it's worse, try Sonnet with a longer refined prompt before giving up.
  • Estimated monthly savings: $5,000 × 90% = ~$4,500/month
  • Implementation time: 1-2 days for the migration + 1 week running the eval in shadow mode + ship.
#2: Add Prompt Caching to the Customer Support Chatbot
  • The waste: 12K conversations × 8 turns = 96K API calls per day. Each turn sends the full system prompt (likely 1.5-3K tokens of persona + policies + instructions) which is billed every time. At 12K convos × 8 turns × ~2K cached-eligible tokens × $15/M tokens on Opus = ~$2,880/month on content that should only cost $288/month (cache-read pricing).
  • The fix: Enable prompt caching on the stable portion of your system prompt. Put the 1.5-3K token 'persona + policies' block BEFORE the conversation history in the prompt structure. Anthropic caches any prompt prefix that's repeated within 5 minutes.
  • The risk: Minimal. Caching doesn't change outputs. Do make sure you're structuring the prompt as [stable system] + [variable conversation] — inverting this kills the cache hit.
  • Estimated monthly savings: ~$2,600/month
  • Implementation time: 4-8 hours. One engineering change.
#3: Migrate Code Review to Sonnet (with Eval)
  • The waste: Code review is engineer-reviewed, which means quality tolerance is medium, not high. Opus is overkill. You're spending $2,250/month on a feature that would likely run equally well on Sonnet for $450/month.
  • The fix: Build a 100-review eval set (take 100 recent reviews + engineer ratings if you have them). Run Sonnet on the same PRs. Compare engineer rating.
  • The risk: Sonnet might miss more subtle architectural issues. Code review is the call type where Opus's advantages sometimes matter. Run the eval BEFORE migrating.
  • Estimated monthly savings: $1,800/month IF the eval passes. If it doesn't, you keep Opus here.
  • Implementation time: 2-3 days for eval + migration.

The Prompt-Caching Audit

#1: Customer support chatbot system prompt — almost certainly not cached correctly. Check your prompt structure: caching triggers when you send the same 2048+ tokens of prefix within 5 minutes. If you're prepending user info to the system prompt, you're breaking the cache per-user. Put user info AFTER the stable system prompt.

#2: Document summarization prompt template — if all 2K docs/night share the same summarization instructions, those instructions should be cached. Batch processing eligible = combined win.

#3: Code review prompt — if you have a stable 'review this code for issues like X, Y, Z' preamble, cache it. The code itself varies per call but the instruction preamble shouldn't.

The Retry / Error Tax

At $14.2K spend and typical 2-4% error/retry rate, you're paying $280-570/month on retries. Not the biggest lever, but worth checking your monitoring — if your error rate is higher than 5%, that's a real cost AND a quality issue worth investigating.

The Batch API Opportunity

Call typeBatch-eligible?Reason
Document summarizationYESOvernight, no latency req. 50% off immediately.
Code reviewNO30-sec response, engineer is waiting
Customer supportNOReal-time chat
Slack botMAYBE5-sec latency is loose; probably worth keeping real-time for simplicity

Your single biggest batch-API win: document summarization. Already counted in Opportunity #1.

The Cost-Cuts You Should NOT Make

  • Customer support chat — do NOT downgrade to Haiku or Sonnet without extensive eval. You have evals for this call type (your 300 test convos) — use them. Customer-facing quality is where Opus earns its cost. Cache the prompt, don't downgrade the model.
  • Hardest technical code reviews — if your code review has a 'deep architecture' mode (reviewing PRs over 500 lines), keep that on Opus even after migrating the simpler reviews. Sonnet handles single-file reviews well; multi-file architectural coherence is where Opus wins.
  • Internal Slack bot — you're already on Sonnet. Moving to Haiku probably works but saves only ~$200/month. Not worth the implementation + eval risk for a low-stakes internal tool.

Projected Savings If You Implement Top 3

FixMonthly savingsImplementation time
Doc summarization: Sonnet + batch$4,500~1 week (migration + eval)
Support chat: prompt caching$2,6001 day
Code review: Sonnet (if eval passes)$1,8003-5 days
Total~$8,900/month~2 weeks of work

Payback: ~2-3 engineering days of salaried work = <$3,000 cost for $100K+ annual savings. This is the single highest-ROI engineering work you can do this quarter.

Key Takeaways

  • Model selection per call-type is worth 5-10x the savings of prompt engineering. Most teams optimize the wrong layer.
  • Prompt caching requires correct structure. Variable content must go AFTER stable content, or the cache never hits.
  • Overnight batch work always belongs on the batch API. 50% off for non-real-time calls — the discount is so large it should be default.
  • Eval before downgrade, not after. The cost of a production regression dwarfs the savings of the model switch.
  • The cost distribution is pareto, not uniform. Audit your top 3 endpoints, not your whole system.

Common use cases

  • AI/ML engineering leads reviewing quarterly LLM spend
  • Startup founders whose OpenAI/Anthropic bill crossed $5K/mo and scares them
  • Platform engineers setting up budgets before production launch
  • PMs prepping cost projections for a new AI feature's business case
  • Consultants auditing a client's AI infrastructure for optimization
  • Solo builders whose personal API bill hit embarrassing territory
  • Enterprises migrating from single-model to multi-model architecture

Best AI model for this

Claude Opus 4 or GPT-5 Thinking. This task requires cross-referencing cost patterns with quality implications — weaker models will recommend cost cuts that break quality.

Pro tips

  • Paste real usage data — your month's token counts per endpoint, model, and average input/output sizes. Abstract 'we use a lot of Opus' analysis is useless.
  • Include the QUALITY requirement per call type. 'Customer-facing' vs 'internal summarization' have different model-downgrade tolerances.
  • Tell the Original about your latency requirements. A call that needs < 2sec response has different options than a batch processing call.
  • Mention whether you have evaluation infrastructure. If you have evals, you can test model swaps empirically. If not, recommendations are more conservative.
  • Include any existing caching — are you using prompt caching? Batch API? Provisioned throughput? The savings stack differently based on your current setup.
  • For the prompt-compression suggestions: always test the compressed version against your evals before production.

Customization tips

  • Always paste real usage numbers — 'we use a lot of Opus' is not auditable. API dashboards export this in CSV; paste it.
  • Before migrating any model, build the eval. Even 30-50 labeled examples of 'good output' is enough to test model swaps empirically.
  • Save the output of this audit. Re-run it quarterly — as your usage patterns change, the optimal routing changes.
  • For teams with no evals yet: start with the lowest-risk opportunity (prompt caching). It doesn't change outputs, only cost. Build eval infrastructure in parallel.
  • When presenting cost-savings to leadership, show the Pareto chart first. 'We can cut 60% of spend by fixing 2 things' is a much easier conversation than 'we should audit everything.'

Variants

Pre-Launch Cost Modeling

For features in development — projects month-1 through month-12 costs based on traffic assumptions, with the 3 cost spikes most teams miss (errors + retries, cold-start prompts, feedback loops).

Migration from Single-Model

For teams currently using one model for everything. Produces the routing logic: which calls go to which model, with the quality/cost tradeoffs per route.

Batch vs Real-Time Analysis

Identifies which calls could move from real-time to batch API (50% cost reduction on Anthropic / OpenAI batch endpoints) without breaking user experience.

Frequently asked questions

How do I use the AI Agent Cost Auditor prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with AI Agent Cost Auditor?

Claude Opus 4 or GPT-5 Thinking. This task requires cross-referencing cost patterns with quality implications — weaker models will recommend cost cuts that break quality.

Can I customize the AI Agent Cost Auditor prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Paste real usage data — your month's token counts per endpoint, model, and average input/output sizes. Abstract 'we use a lot of Opus' analysis is useless.; Include the QUALITY requirement per call type. 'Customer-facing' vs 'internal summarization' have different model-downgrade tolerances.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals