⚡ Promptolis Original · AI Agents & Automation
📊 AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents
The structured observability architecture for production Claude agents — covering trace instrumentation, structured logging, cost attribution, eval pipelines, and alert thresholds, with the full Langfuse/Datadog/Grafana stack map that turns opaque agents into debuggable systems.
Why this is epic
Most production AI agents are black boxes. When they misbehave, teams debug by reading Cloud Trail logs and guessing. This Original produces the complete observability architecture: traces (OpenTelemetry), structured logs (with LLM-specific fields), cost attribution (per-user + per-tool), evals pipeline (regression + drift detection), and alerts (latency + cost + error rate). Based on patterns from teams running Claude agents in production at scale.
Names the 6 observability layers every production agent needs — invocation trace / tool-call spans / prompt-response logs / cost attribution / eval pipeline / alerting — and the specific tool choices for each (Langfuse, OpenTelemetry, Datadog, Grafana, Axiom, custom eval frameworks).
Produces the complete stack map with specific SDK integrations (Anthropic SDK spans, MCP server traces, LangSmith for prompt versioning if using), sample dashboards, alert thresholds, and the on-call runbook. Based on production deployments handling 1M+ agent invocations/month.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<agent-context>B2B SaaS AI assistant for marketing teams. Users ask the agent to do things like 'analyze my Q3 campaign performance' or 'draft 5 subject line variants for this email.' Agent has access to: their connected HubSpot + Google Analytics + their own CMS. Uses Claude Sonnet 4.5 + tool calls. Deployed on our own infra (AWS ECS Fargate).</agent-context> <scale>~8,000 invocations/day currently (across ~500 paying customers). Growing 25%/month. Expected 40K invocations/day in 6 months.</scale> <current-observability>Basic CloudWatch logs. We can see errors but not much else. No cost visibility per user. No evals. No tracing. When users report bad output, we have no ability to reproduce or debug beyond reading raw logs.</current-observability> <tech-stack>Python 3.11 backend (FastAPI). Node.js frontend. AWS infra (ECS, RDS Postgres, S3, CloudWatch). No current LLM-specific tooling.</tech-stack> <compliance-requirements>SOC2 Type II (achieved), GDPR (EU customers). Need audit trail of prompts + responses for 90 days. PII in customer data (marketing contacts — emails, names).</compliance-requirements> <cost-model>Need per-user cost attribution. We want to identify top 10% spenders, investigate cost outliers, and potentially introduce usage-based pricing tiers.</cost-model> <critical-metrics>Critical: output quality (when users complain, can we debug?), cost per user (are we unit-economics-positive?), latency P99 (users churn if slow). Important: error rate, tool-use correctness, hallucination rate.</critical-metrics> <budget>Can afford paid tools. Currently 0 budget on observability — getting new line item for this. Probably $2-5K/month is approvable for right tooling.</budget> </input>
Observability Architecture: Marketing AI Assistant
An AI agent observability architecture is a 6-layer system (invocation trace / tool-call spans / prompt-response logs / cost attribution / eval pipeline / alerting) that transforms opaque agents into debuggable systems. Based on patterns from teams running 1M+ agent invocations/month, the typical failure pattern is 'observability bolted on after scale pain' — retrofitting costs 5x more than building it in from the start. Your situation (8K→40K invocations/day projected, SOC2 achieved, GDPR concerns, need per-user cost attribution, $2-5K/mo budget) is ideal for the mid-tier paid stack: Langfuse (hosted) + OpenTelemetry + Datadog (existing) + custom eval pipeline. This architecture produces full observability in 4 weeks of implementation, ~$1,800/mo run-rate, and positions you for the 5x growth coming.
Current State + Gap Analysis
Current state: CloudWatch logs only. Effectively zero LLM-specific observability.
Gaps (critical):
- No distributed tracing — can't reproduce failures
- No prompt/response logging — can't debug quality complaints
- No cost attribution — can't identify expensive users
- No eval pipeline — prompt changes ship without validation
- No LLM-aware alerting — only infrastructure alerts
- No audit trail for compliance
Gaps (important but not critical):
- No prompt versioning — hard to correlate behavior changes with prompt changes
- No hallucination detection — quality issues discovered by users
Verdict: Current state is below the bar for a B2B SaaS processing customer data at this scale. Urgency: start this within 2 weeks.
Layer 1: Invocation Trace
Tool: OpenTelemetry + Langfuse
Instrument every agent invocation with a root span:
agent.invocation(root span)
- user_id
- organization_id
- feature (which agent — analyze / draft / query)
- input_type (text / voice / structured)
- total_latency_ms
- total_cost_usd
- success/failure
- output_length
Python SDK pattern:
from opentelemetry import trace
from langfuse.decorators import observe
@observe(name='agent.invocation')
async def run_agent(user_id: str, query: str):
span = trace.get_current_span()
span.set_attribute('user_id', user_id)
span.set_attribute('feature', detect_feature(query))
# ... agent logic
return response
Langfuse integrates natively with OTEL and Anthropic SDK. Traces from Anthropic SDK (prompt, response, tokens) flow automatically into Langfuse.
Layer 2: Tool-Call Spans
Tool: OpenTelemetry child spans within invocation trace
Every tool call gets a child span:
tool.hubspot_query(child of agent.invocation)
- tool_name
- input_params
- latency_ms
- status (success/error)
- error_message (if applicable)
- output_size_bytes
This lets you reconstruct the full graph of a failed invocation:
agent.invocation (10.2s, failed)
├── llm.claude.call (2.1s, ok)
├── tool.hubspot_query (6.8s, ok)
├── tool.ga4_query (0.9s, ok)
└── llm.claude.call (0.4s, failed — context window exceeded)
Immediately visible: context window exceeded after tool calls returned too much data.
Layer 3: Prompt-Response Logs
Tool: Langfuse (hosted) for structured LLM logs + S3 for long-term archive
Every LLM call logs:
- prompt (full text, PII-redacted)
- response (full text, PII-redacted)
- model (claude-sonnet-4-5-20250929 or similar)
- prompt_version (if versioned)
- input_tokens
- output_tokens
- cost_usd (computed from tokens × rates)
- latency_ms
- tool_calls_requested (names + params)
- tool_calls_executed (names + params + results)
- user_id (linked to invocation trace)
PII redaction:
- Pre-write middleware scrubs emails, phone numbers, credit cards via regex
- For customer data (marketing contacts), hash PII into opaque IDs before logging
- Keep original data in Postgres (encrypted) if debugging requires it
Retention: 90 days in Langfuse (hot, queryable), 2 years in S3 (cold, compliance).
Layer 4: Cost Attribution
Your critical need. Per-user cost attribution for pricing tier decisions.
Implementation:
- Every LLM call includes user_id in Langfuse metadata
- Every tool call has cost attribution (HubSpot API = free, GA4 API = free, but compute = something)
- Aggregate hourly into a
user_costsPostgres table
Schema:
CREATE TABLE user_costs (
user_id TEXT,
organization_id TEXT,
date DATE,
hour INT,
invocation_count INT,
total_input_tokens INT,
total_output_tokens INT,
total_cost_usd DECIMAL(10,4),
by_model JSONB, -- {claude-sonnet: $0.45, claude-opus: $1.20}
by_feature JSONB -- {analyze: $0.80, draft: $0.85}
);
Dashboards:
- Top 10% spenders (daily/weekly/monthly)
- Cost-per-user distribution histogram
- Unit economics: cost-per-user vs. plan-revenue-per-user
- Feature cost breakdown (which features are expensive?)
Weekly cost-attribution report:
- Emailed to product + exec
- Identifies: top spenders, cost anomalies, features driving growth in spend
Layer 5: Eval Pipeline
Tool: Custom pipeline using Langfuse datasets + Claude as judge
Regression eval (runs on every prompt change):
- Test set: 50 curated real user queries with known-good outputs
- Run each through agent with new prompt
- Use Claude Opus as judge: rates output vs. expected on 5-point scale
- Flag regressions: any query scoring <3 when expected ≥4
- Block deploy if >2 regressions
Drift eval (runs daily):
- Sample 100 real production invocations
- Have Claude Opus score them on 5-point scale for quality
- Track rolling 7-day average score
- Alert if score drops >0.3 below baseline
Eval implementation:
async def run_eval(dataset_id: str, agent_version: str):
dataset = langfuse.get_dataset(dataset_id)
results = []
for item in dataset.items:
response = await run_agent(item.input, agent_version=agent_version)
score = await judge_with_claude(item.input, response, item.expected_output)
results.append({'item_id': item.id, 'score': score, 'response': response})
langfuse.log_eval_result(dataset_id, item.id, score)
return summarize(results)
CI integration: GitHub Actions runs eval on every PR that touches prompts.
Layer 6: Alerting
Tool: Datadog (existing) + PagerDuty (existing)
Alert categories:
1. Availability alerts (P0):
- Agent invocation error rate > 5% over 10 min → PagerDuty
- Agent invocation p99 latency > 30s → PagerDuty
- Full service down (zero invocations in 5 min during business hours) → PagerDuty
2. Cost alerts (P1):
- Hourly cost > $500 (≈ 10x normal) → Slack #ai-ops
- Daily cost > $3,000 (≈ 5x normal) → Slack + PagerDuty
- Single user cost > $50/day → Slack (investigate possible abuse)
- Weekly cost spike detection (3σ above 7-day rolling mean) → Slack
3. Quality alerts (P2):
- Daily eval score drops >0.3 → Slack #ai-eng
- User-reported quality complaints spike (>3/hour) → Slack
- Hallucination detection (user reports 'wrong' feedback) > 5% over 1h → Slack
4. Tool alerts (P2):
- Tool call error rate (HubSpot/GA4 integration) > 10% over 15 min → Slack
- Tool call latency p99 > 10s → Slack
Stack Map
Your specific stack:
| Layer | Tool | Cost/month | Why |
|---|---|---|---|
| Distributed tracing | OpenTelemetry (open-source) | $0 | Industry standard |
| LLM traces + logs | Langfuse (hosted) | $499/mo (Team plan) | Best-in-class LLM observability, integrates with Anthropic SDK |
| Metrics + dashboards | Datadog (existing) | Existing line item | Already deployed, can add LLM metrics |
| Log archive (compliance) | S3 + Athena | ~$50/mo | Cold storage for 2-year retention |
| Eval pipeline | Custom Python + Langfuse datasets | $0 (dev time) | Flexible, no vendor lock-in |
| Cost attribution DB | Postgres (existing RDS) | $0 (existing) | Join with user_id for attribution |
| Alerting | Datadog + PagerDuty (existing) | Existing | Already deployed |
| PII redaction | Custom middleware | $0 (dev time) | Compliance requirement |
Total new spend: ~$550/month. Well within $2-5K budget.
Dashboards
Dashboard 1: Agent Health (primary on-call view)
- Invocations/minute
- Error rate
- Latency P50 / P95 / P99
- Tool call success rate (by tool)
- Active user count
Dashboard 2: Cost Attribution
- Cost today / this week / this month
- Top 10 spenders today
- Cost distribution histogram
- Cost-per-user vs revenue-per-user (unit economics)
- Feature cost breakdown
Dashboard 3: Quality
- Daily eval score (rolling 7-day avg)
- User thumbs-up/down rate
- Hallucination report rate
- Regression eval pass rate (CI)
Dashboard 4: Tool Performance
- Tool latency (HubSpot / GA4 / CMS)
- Tool error rate (per tool)
- Tool call volume
Runbook Integration
When an alert fires at 3am:
1. On-call goes to Agent Health dashboard first — confirms scope (single-user vs systemic)
2. If systemic: check Dashboard 4 (Tool Performance) — often issue is upstream
3. If quality alert: check Langfuse traces for sample failures — reproduce + investigate
4. If cost spike: check Cost Attribution dashboard — identify spiking user, rate-limit if abuse
5. If eval regression: roll back last prompt change, investigate in AM
Runbook links in every alert to direct on-call to right dashboard.
Implementation Roadmap
Week 1: Foundation
- Deploy OpenTelemetry in Python backend
- Integrate Langfuse with Anthropic SDK
- Structured logging with LLM-specific fields
- PII redaction middleware
Week 2: Tracing + Cost
- Tool-call spans
- user_costs Postgres table + hourly aggregation job
- Dashboard 1 (Agent Health) live in Datadog
- Dashboard 2 (Cost Attribution) live in Datadog
Week 3: Evals
- Build test dataset (50 curated queries, known-good outputs)
- Implement regression eval
- CI integration (GitHub Actions)
- Start collecting user feedback (thumbs up/down)
Week 4: Alerts + Polish
- All alerts wired in Datadog + PagerDuty
- Runbook documentation
- Team training session
- 2-year S3 archive for compliance
Post-week 4: drift eval + hallucination detection + prompt versioning (roadmap items).
Key Takeaways
- 6 observability layers: invocation trace, tool-call spans, prompt-response logs, cost attribution, eval pipeline, alerting. All essential for production.
- Your stack: OpenTelemetry + Langfuse + Datadog (existing) + S3 (compliance archive). ~$550/mo new spend.
- Per-user cost attribution is critical for your B2B model. Top 10% users likely drive 80% of cost. Dashboard surfaces this for pricing decisions.
- Eval pipeline on every prompt change prevents silent regressions. 50-query test set + Claude Opus as judge + CI integration blocks bad deploys.
- Implementation: 4 weeks. Start within 2 weeks — retrofitting observability at 40K invocations/day (your 6-month projection) is 5x harder than building now at 8K/day.
Common use cases
- AI engineering teams taking agents from prototype to production
- Platform teams building agent infrastructure at scale
- DevOps/SRE teams supporting LLM-powered products
- Startups preparing for AI product launch with observability day-1
- Enterprise teams meeting SOC2/ISO compliance for AI systems
- Agentic product teams (Claude Code-style, Cursor-style tools) needing visibility
- Cost/FinOps teams attributing LLM spend across teams or customers
- Eval/quality teams building regression suites for agent behavior
- Security teams auditing LLM usage patterns
Best AI model for this
Claude Opus 4 or Sonnet 4.5. Observability architecture requires reasoning about distributed systems, LLM-specific concerns, cost economics, and operational practices. Top-tier reasoning matters.
Pro tips
- Instrument FIRST, optimize later. A production agent without traces is undebuggable. Add observability on day 1, even if the stack is simple (just structured logs is better than nothing).
- Trace every LLM call with: model, tokens-in, tokens-out, cost, latency, prompt-version (if you version), tool-calls-requested, tool-calls-executed, final-output-length. These 8 fields cover 90% of debugging needs.
- Cost attribution per USER is essential for B2B SaaS. 10% of your users probably drive 80% of costs. Without per-user attribution, you can't identify or charge them appropriately.
- Eval pipelines should run on EVERY prompt change. Even 'small' prompt tweaks can degrade performance silently. Automated evals catch regressions before users do.
- LLM latency has a long tail. P50 might be 2 seconds, P99 might be 20 seconds. Monitor P99 more than P50 — it's what your frustrated user experiences.
- Log the PROMPT, not just the response. Debugging requires seeing what Claude actually received. Redact sensitive data at log-write time, not at query time.
- For tool-using agents, log the tool-call GRAPH (what was called, what returned, in what order). Reconstructing from individual log lines is painful; structured graph data is debugger-friendly.
- Alert on COST spikes, not just cost thresholds. Absolute threshold alerts fire too late. Spike detection (e.g., 3σ above rolling 7-day mean) catches problems earlier.
Customization tips
- Start with LOGGING before TRACING. A single week of good structured logs gives you 70% of the debugging value. Traces are additive once logs work.
- Don't skip PII redaction. GDPR + SOC2 both require it for user data. Retrofit is painful — build it in from first log line.
- Cost attribution via Langfuse is built-in but the aggregation to your Postgres is custom work. Plan for 2-3 days of implementation for clean per-user rollups.
- Your eval test set should grow over time. Every user-reported bug becomes a test case. Within 6 months, aim for 200+ test cases covering edge cases + common patterns.
- For regulated industries (HIPAA, FINRA), add eval for compliance-specific concerns (e.g., 'did agent say anything that could be interpreted as medical/financial advice'). Compliance eval is its own pipeline.
Variants
Startup/MVP Mode
For teams shipping first production agent. Minimal stack — structured logs + cost tracking + basic alerts. Can upgrade later.
Scale-Up Mode
For teams at 100K+ agent invocations/month. Full stack — OpenTelemetry + Langfuse + Datadog + evals pipeline.
Enterprise Compliance Mode
For regulated industries. Emphasizes audit logs, data retention, PII handling, and compliance-ready monitoring.
Cost-Attribution Mode
For teams focused on FinOps. Emphasizes per-user + per-tool cost attribution, chargeback data, cost optimization dashboards.
Frequently asked questions
How do I use the AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents?
Claude Opus 4 or Sonnet 4.5. Observability architecture requires reasoning about distributed systems, LLM-specific concerns, cost economics, and operational practices. Top-tier reasoning matters.
Can I customize the AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Instrument FIRST, optimize later. A production agent without traces is undebuggable. Add observability on day 1, even if the stack is simple (just structured logs is better than nothing).; Trace every LLM call with: model, tokens-in, tokens-out, cost, latency, prompt-version (if you version), tool-calls-requested, tool-calls-executed, final-output-length. These 8 fields cover 90% of debugging needs.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals