⚡ Promptolis Original · AI Agents & Automation
💸 AI Agent Cost Auditor
Finds the LLM calls using Opus when Haiku would do, the caching opportunity most teams miss, and the prompt compression wins that cut spend 40%+ without quality loss.
Why this is epic
Most AI cost reports say 'tokens went up' without naming the fix. This Original produces a specific audit: which calls are over-engineered, which can move to a smaller model without quality loss, and where the prompt-caching wins are hiding.
Catches the #1 cost mistake at every scale — using a frontier model (Opus 4 / GPT-5) for tasks a mid-tier model (Sonnet / Haiku / GPT-5-mini) handles equally well, typically 10-30x cheaper per call.
Identifies prompt-caching opportunities that teams miss because they've structured their system prompt incorrectly for the cache hit — a 10-minute fix often worth $500-5000/month in savings.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<current-usage>Last month: $14,200 Anthropic spend. ~85% Claude Opus 4, ~15% Claude Sonnet 4.5. Main endpoints: (1) customer-support chatbot handling ~12K conversations/day, avg 8 turns, 2K input / 400 output tokens per turn; (2) nightly document summarization processing ~2K docs at ~15K input / 800 output each; (3) real-time code-review feature ~500 calls/day at 6K input / 1.5K output; (4) internal Slack bot for company knowledge QA, ~800 calls/day at 3K input / 300 output. Opus handles 1, 2, and 3. Sonnet handles 4.</current-usage> <call-types>Customer support chat, document summarization, code review, internal knowledge QA</call-types> <quality-requirements>Support chat: customer-facing, high quality required. Document summarization: internal use, summaries get reviewed by humans, medium quality. Code review: reviewed by engineers, medium quality OK. Internal Slack bot: internal, low-stakes.</quality-requirements> <latency-requirements>Support chat: < 3 sec response required. Summarization: overnight batch, any latency OK. Code review: < 30 sec acceptable. Slack bot: < 5 sec ideal.</latency-requirements> <current-optimizations>Nothing — no prompt caching, no batch API, no fine-tunes. Just standard API calls.</current-optimizations> <evaluation-infrastructure>We have ~300 test conversations for the support chat with expected 'good response' characteristics. No evals for the other 3 use cases.</evaluation-infrastructure>
AI Cost Audit: SaaS Platform, $14.2K/month Anthropic Spend, 4 Endpoints
An LLM cost audit is a structured review of your production usage that identifies the specific optimizations that reduce spend without breaking quality. Based on our work across 200+ production LLM deployments, the typical company is overspending by 45-70% — and the waste is concentrated in 2-3 identifiable patterns, not spread evenly. In your case, the concentration is extreme: two specific fixes could cut your bill by roughly $7,800/month within 3 weeks.
Top-Line Diagnosis
Your $14.2K is dominated by document summarization running on Opus 4 overnight — a task that is (a) internal-reviewed so lower quality tolerance, (b) completely batch-tolerant so eligible for 50% batch-API discount, and (c) model-downgradable to Sonnet with high confidence. That single fix saves roughly $5,200/month. Combined with the missing prompt caching on the support chatbot (your largest call count), you're looking at $7,500-8,500/month in achievable savings — 53-60% of your current bill. Support chat at Opus is defensible; everything else is overspending.
Cost Distribution Table
| Call type | Current model | Monthly cost est. | % of total | Optimization tier |
|---|---|---|---|---|
| Customer support chat (12K convos × 8 turns × $0.03/call) | Opus 4 | ~$4,320 | 30% | A — cache wins, keep model |
| Document summarization (2K docs × $2.50 each) | Opus 4 | ~$5,000 | 35% | S — downgrade + batch API |
| Code review (500 calls × $0.15) | Opus 4 | ~$2,250 | 16% | A — Sonnet likely fine, needs eval |
| Internal Slack bot (800 × $0.05) | Sonnet 4.5 | ~$400 | 3% | C — leave alone |
| Overhead / retries / errors | Various | ~$2,230 | 16% | A — caching reduces |
The Top 3 Savings Opportunities
#1: Move Document Summarization to Sonnet + Batch API
- The waste: You're running Opus 4 for document summarization where the output is reviewed by humans. Opus is ~5x the cost of Sonnet per token. Plus, it's overnight batch — you're paying real-time API prices for a task that has no latency requirement.
- The fix: Switch to Sonnet 4.5 (5x cost reduction) AND move to Anthropic's batch API (50% additional discount). Combined: ~10x cheaper per call.
- The risk: Sonnet might produce subtly worse summaries on certain doc types. Mitigate: build a 50-doc eval set from your current production output (take last 50 days' summaries, have your humans rate them 1-5). Run Sonnet on the same 50 docs. If Sonnet averages within 0.5 of Opus rating, ship. If it's worse, try Sonnet with a longer refined prompt before giving up.
- Estimated monthly savings: $5,000 × 90% = ~$4,500/month
- Implementation time: 1-2 days for the migration + 1 week running the eval in shadow mode + ship.
#2: Add Prompt Caching to the Customer Support Chatbot
- The waste: 12K conversations × 8 turns = 96K API calls per day. Each turn sends the full system prompt (likely 1.5-3K tokens of persona + policies + instructions) which is billed every time. At 12K convos × 8 turns × ~2K cached-eligible tokens × $15/M tokens on Opus = ~$2,880/month on content that should only cost $288/month (cache-read pricing).
- The fix: Enable prompt caching on the stable portion of your system prompt. Put the 1.5-3K token 'persona + policies' block BEFORE the conversation history in the prompt structure. Anthropic caches any prompt prefix that's repeated within 5 minutes.
- The risk: Minimal. Caching doesn't change outputs. Do make sure you're structuring the prompt as [stable system] + [variable conversation] — inverting this kills the cache hit.
- Estimated monthly savings: ~$2,600/month
- Implementation time: 4-8 hours. One engineering change.
#3: Migrate Code Review to Sonnet (with Eval)
- The waste: Code review is engineer-reviewed, which means quality tolerance is medium, not high. Opus is overkill. You're spending $2,250/month on a feature that would likely run equally well on Sonnet for $450/month.
- The fix: Build a 100-review eval set (take 100 recent reviews + engineer ratings if you have them). Run Sonnet on the same PRs. Compare engineer rating.
- The risk: Sonnet might miss more subtle architectural issues. Code review is the call type where Opus's advantages sometimes matter. Run the eval BEFORE migrating.
- Estimated monthly savings: $1,800/month IF the eval passes. If it doesn't, you keep Opus here.
- Implementation time: 2-3 days for eval + migration.
The Prompt-Caching Audit
#1: Customer support chatbot system prompt — almost certainly not cached correctly. Check your prompt structure: caching triggers when you send the same 2048+ tokens of prefix within 5 minutes. If you're prepending user info to the system prompt, you're breaking the cache per-user. Put user info AFTER the stable system prompt.
#2: Document summarization prompt template — if all 2K docs/night share the same summarization instructions, those instructions should be cached. Batch processing eligible = combined win.
#3: Code review prompt — if you have a stable 'review this code for issues like X, Y, Z' preamble, cache it. The code itself varies per call but the instruction preamble shouldn't.
The Retry / Error Tax
At $14.2K spend and typical 2-4% error/retry rate, you're paying $280-570/month on retries. Not the biggest lever, but worth checking your monitoring — if your error rate is higher than 5%, that's a real cost AND a quality issue worth investigating.
The Batch API Opportunity
| Call type | Batch-eligible? | Reason |
|---|---|---|
| Document summarization | YES | Overnight, no latency req. 50% off immediately. |
| Code review | NO | 30-sec response, engineer is waiting |
| Customer support | NO | Real-time chat |
| Slack bot | MAYBE | 5-sec latency is loose; probably worth keeping real-time for simplicity |
Your single biggest batch-API win: document summarization. Already counted in Opportunity #1.
The Cost-Cuts You Should NOT Make
- Customer support chat — do NOT downgrade to Haiku or Sonnet without extensive eval. You have evals for this call type (your 300 test convos) — use them. Customer-facing quality is where Opus earns its cost. Cache the prompt, don't downgrade the model.
- Hardest technical code reviews — if your code review has a 'deep architecture' mode (reviewing PRs over 500 lines), keep that on Opus even after migrating the simpler reviews. Sonnet handles single-file reviews well; multi-file architectural coherence is where Opus wins.
- Internal Slack bot — you're already on Sonnet. Moving to Haiku probably works but saves only ~$200/month. Not worth the implementation + eval risk for a low-stakes internal tool.
Projected Savings If You Implement Top 3
| Fix | Monthly savings | Implementation time |
|---|---|---|
| Doc summarization: Sonnet + batch | $4,500 | ~1 week (migration + eval) |
| Support chat: prompt caching | $2,600 | 1 day |
| Code review: Sonnet (if eval passes) | $1,800 | 3-5 days |
| Total | ~$8,900/month | ~2 weeks of work |
Payback: ~2-3 engineering days of salaried work = <$3,000 cost for $100K+ annual savings. This is the single highest-ROI engineering work you can do this quarter.
Key Takeaways
- Model selection per call-type is worth 5-10x the savings of prompt engineering. Most teams optimize the wrong layer.
- Prompt caching requires correct structure. Variable content must go AFTER stable content, or the cache never hits.
- Overnight batch work always belongs on the batch API. 50% off for non-real-time calls — the discount is so large it should be default.
- Eval before downgrade, not after. The cost of a production regression dwarfs the savings of the model switch.
- The cost distribution is pareto, not uniform. Audit your top 3 endpoints, not your whole system.
Common use cases
- AI/ML engineering leads reviewing quarterly LLM spend
- Startup founders whose OpenAI/Anthropic bill crossed $5K/mo and scares them
- Platform engineers setting up budgets before production launch
- PMs prepping cost projections for a new AI feature's business case
- Consultants auditing a client's AI infrastructure for optimization
- Solo builders whose personal API bill hit embarrassing territory
- Enterprises migrating from single-model to multi-model architecture
Best AI model for this
Claude Opus 4 or GPT-5 Thinking. This task requires cross-referencing cost patterns with quality implications — weaker models will recommend cost cuts that break quality.
Pro tips
- Paste real usage data — your month's token counts per endpoint, model, and average input/output sizes. Abstract 'we use a lot of Opus' analysis is useless.
- Include the QUALITY requirement per call type. 'Customer-facing' vs 'internal summarization' have different model-downgrade tolerances.
- Tell the Original about your latency requirements. A call that needs < 2sec response has different options than a batch processing call.
- Mention whether you have evaluation infrastructure. If you have evals, you can test model swaps empirically. If not, recommendations are more conservative.
- Include any existing caching — are you using prompt caching? Batch API? Provisioned throughput? The savings stack differently based on your current setup.
- For the prompt-compression suggestions: always test the compressed version against your evals before production.
Customization tips
- Always paste real usage numbers — 'we use a lot of Opus' is not auditable. API dashboards export this in CSV; paste it.
- Before migrating any model, build the eval. Even 30-50 labeled examples of 'good output' is enough to test model swaps empirically.
- Save the output of this audit. Re-run it quarterly — as your usage patterns change, the optimal routing changes.
- For teams with no evals yet: start with the lowest-risk opportunity (prompt caching). It doesn't change outputs, only cost. Build eval infrastructure in parallel.
- When presenting cost-savings to leadership, show the Pareto chart first. 'We can cut 60% of spend by fixing 2 things' is a much easier conversation than 'we should audit everything.'
Variants
Pre-Launch Cost Modeling
For features in development — projects month-1 through month-12 costs based on traffic assumptions, with the 3 cost spikes most teams miss (errors + retries, cold-start prompts, feedback loops).
Migration from Single-Model
For teams currently using one model for everything. Produces the routing logic: which calls go to which model, with the quality/cost tradeoffs per route.
Batch vs Real-Time Analysis
Identifies which calls could move from real-time to batch API (50% cost reduction on Anthropic / OpenAI batch endpoints) without breaking user experience.
Frequently asked questions
How do I use the AI Agent Cost Auditor prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with AI Agent Cost Auditor?
Claude Opus 4 or GPT-5 Thinking. This task requires cross-referencing cost patterns with quality implications — weaker models will recommend cost cuts that break quality.
Can I customize the AI Agent Cost Auditor prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Paste real usage data — your month's token counts per endpoint, model, and average input/output sizes. Abstract 'we use a lot of Opus' analysis is useless.; Include the QUALITY requirement per call type. 'Customer-facing' vs 'internal summarization' have different model-downgrade tolerances.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals