⚡ Promptolis Original · AI Agents & Automation

📊 AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents

The structured observability architecture for production Claude agents — covering trace instrumentation, structured logging, cost attribution, eval pipelines, and alert thresholds, with the full Langfuse/Datadog/Grafana stack map that turns opaque agents into debuggable systems.

⏱️ 12 min to design + 1-2 days to implement 🤖 ~2 min in Claude 🗓️ Updated 2026-04-20

Why this is epic

Most production AI agents are black boxes. When they misbehave, teams debug by reading Cloud Trail logs and guessing. This Original produces the complete observability architecture: traces (OpenTelemetry), structured logs (with LLM-specific fields), cost attribution (per-user + per-tool), evals pipeline (regression + drift detection), and alerts (latency + cost + error rate). Based on patterns from teams running Claude agents in production at scale.

Names the 6 observability layers every production agent needs — invocation trace / tool-call spans / prompt-response logs / cost attribution / eval pipeline / alerting — and the specific tool choices for each (Langfuse, OpenTelemetry, Datadog, Grafana, Axiom, custom eval frameworks).

Produces the complete stack map with specific SDK integrations (Anthropic SDK spans, MCP server traces, LangSmith for prompt versioning if using), sample dashboards, alert thresholds, and the on-call runbook. Based on production deployments handling 1M+ agent invocations/month.

The prompt

Promptolis Original · Copy-ready
<role> You are an AI systems observability architect with 5+ years of experience taking production LLM applications from prototype to scale. You've designed observability stacks for agents at startups, mid-stage companies, and enterprises — handling up to 10M agent invocations/month. You know the SDK integration patterns (Anthropic, OpenAI, LangSmith, Langfuse), infrastructure tools (Datadog, Grafana, Axiom, Honeycomb, OpenTelemetry), and the LLM-specific observability concerns (prompt versioning, cost attribution, eval pipelines). You are direct. You will name when an observability stack is over-engineered, when it's missing essential pieces, when cost attribution is broken, and when eval pipelines are absent. </role> <principles> 1. 6 observability layers: invocation trace, tool-call spans, prompt-response logs, cost attribution, eval pipeline, alerting. 2. Instrument day-1. Retrofitting observability later is 5x harder. 3. Every LLM call: model, tokens, cost, latency, prompt-version, tool graph, output length. 4. Per-user cost attribution for B2B. 5. Eval pipeline on every prompt change. 6. Monitor P99, not just P50. 7. Log prompts (redacted for PII). 8. Alert on spikes (3σ), not just thresholds. </principles> <input> <agent-context>{what the agent does + environment}</agent-context> <scale>{invocations per day/month, concurrent users}</scale> <current-observability>{what you have now — nothing / basic logs / full stack}</current-observability> <tech-stack>{Python/TS/etc, existing monitoring tools}</tech-stack> <compliance-requirements>{SOC2, HIPAA, GDPR, none}</compliance-requirements> <cost-model>{per-user cost attribution needed / org-wide is fine}</cost-model> <critical-metrics>{what matters most — latency, cost, error rate, quality}</critical-metrics> <budget>{can you afford paid tools or need open-source}</budget> </input> <output-format> # Observability Architecture: [Agent name] ## Current State + Gap Analysis What you have vs. what you need. ## Layer 1: Invocation Trace Distributed tracing with OpenTelemetry. ## Layer 2: Tool-Call Spans Granular tool-use tracing. ## Layer 3: Prompt-Response Logs Structured logging with LLM-specific fields. ## Layer 4: Cost Attribution Per-user / per-tool / per-feature cost tracking. ## Layer 5: Eval Pipeline Regression + drift detection. ## Layer 6: Alerting Thresholds + spike detection. ## Stack Map Specific tools + integrations. ## Dashboards Which dashboards, what's on them. ## Runbook Integration How this helps on-call. ## Implementation Roadmap Week-by-week rollout. ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for agent context, scale, current observability, tech stack, compliance, cost model, critical metrics, budget. </auto-intake> Now, design:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<agent-context>B2B SaaS AI assistant for marketing teams. Users ask the agent to do things like 'analyze my Q3 campaign performance' or 'draft 5 subject line variants for this email.' Agent has access to: their connected HubSpot + Google Analytics + their own CMS. Uses Claude Sonnet 4.5 + tool calls. Deployed on our own infra (AWS ECS Fargate).</agent-context>
<scale>~8,000 invocations/day currently (across ~500 paying customers). Growing 25%/month. Expected 40K invocations/day in 6 months.</scale>
<current-observability>Basic CloudWatch logs. We can see errors but not much else. No cost visibility per user. No evals. No tracing. When users report bad output, we have no ability to reproduce or debug beyond reading raw logs.</current-observability>
<tech-stack>Python 3.11 backend (FastAPI). Node.js frontend. AWS infra (ECS, RDS Postgres, S3, CloudWatch). No current LLM-specific tooling.</tech-stack>
<compliance-requirements>SOC2 Type II (achieved), GDPR (EU customers). Need audit trail of prompts + responses for 90 days. PII in customer data (marketing contacts — emails, names).</compliance-requirements>
<cost-model>Need per-user cost attribution. We want to identify top 10% spenders, investigate cost outliers, and potentially introduce usage-based pricing tiers.</cost-model>
<critical-metrics>Critical: output quality (when users complain, can we debug?), cost per user (are we unit-economics-positive?), latency P99 (users churn if slow). Important: error rate, tool-use correctness, hallucination rate.</critical-metrics>
<budget>Can afford paid tools. Currently 0 budget on observability — getting new line item for this. Probably $2-5K/month is approvable for right tooling.</budget>
</input>
🤖 Output

Observability Architecture: Marketing AI Assistant

An AI agent observability architecture is a 6-layer system (invocation trace / tool-call spans / prompt-response logs / cost attribution / eval pipeline / alerting) that transforms opaque agents into debuggable systems. Based on patterns from teams running 1M+ agent invocations/month, the typical failure pattern is 'observability bolted on after scale pain' — retrofitting costs 5x more than building it in from the start. Your situation (8K→40K invocations/day projected, SOC2 achieved, GDPR concerns, need per-user cost attribution, $2-5K/mo budget) is ideal for the mid-tier paid stack: Langfuse (hosted) + OpenTelemetry + Datadog (existing) + custom eval pipeline. This architecture produces full observability in 4 weeks of implementation, ~$1,800/mo run-rate, and positions you for the 5x growth coming.

Current State + Gap Analysis

Current state: CloudWatch logs only. Effectively zero LLM-specific observability.

Gaps (critical):

  • No distributed tracing — can't reproduce failures
  • No prompt/response logging — can't debug quality complaints
  • No cost attribution — can't identify expensive users
  • No eval pipeline — prompt changes ship without validation
  • No LLM-aware alerting — only infrastructure alerts
  • No audit trail for compliance

Gaps (important but not critical):

  • No prompt versioning — hard to correlate behavior changes with prompt changes
  • No hallucination detection — quality issues discovered by users

Verdict: Current state is below the bar for a B2B SaaS processing customer data at this scale. Urgency: start this within 2 weeks.

Layer 1: Invocation Trace

Tool: OpenTelemetry + Langfuse

Instrument every agent invocation with a root span:

  • agent.invocation (root span)

- user_id

- organization_id

- feature (which agent — analyze / draft / query)

- input_type (text / voice / structured)

- total_latency_ms

- total_cost_usd

- success/failure

- output_length

Python SDK pattern:

from opentelemetry import trace
from langfuse.decorators import observe

@observe(name='agent.invocation')
async def run_agent(user_id: str, query: str):
    span = trace.get_current_span()
    span.set_attribute('user_id', user_id)
    span.set_attribute('feature', detect_feature(query))
    # ... agent logic
    return response

Langfuse integrates natively with OTEL and Anthropic SDK. Traces from Anthropic SDK (prompt, response, tokens) flow automatically into Langfuse.

Layer 2: Tool-Call Spans

Tool: OpenTelemetry child spans within invocation trace

Every tool call gets a child span:

  • tool.hubspot_query (child of agent.invocation)

- tool_name

- input_params

- latency_ms

- status (success/error)

- error_message (if applicable)

- output_size_bytes

This lets you reconstruct the full graph of a failed invocation:

agent.invocation (10.2s, failed)
├── llm.claude.call (2.1s, ok)
├── tool.hubspot_query (6.8s, ok)
├── tool.ga4_query (0.9s, ok)
└── llm.claude.call (0.4s, failed — context window exceeded)

Immediately visible: context window exceeded after tool calls returned too much data.

Layer 3: Prompt-Response Logs

Tool: Langfuse (hosted) for structured LLM logs + S3 for long-term archive

Every LLM call logs:

  • prompt (full text, PII-redacted)
  • response (full text, PII-redacted)
  • model (claude-sonnet-4-5-20250929 or similar)
  • prompt_version (if versioned)
  • input_tokens
  • output_tokens
  • cost_usd (computed from tokens × rates)
  • latency_ms
  • tool_calls_requested (names + params)
  • tool_calls_executed (names + params + results)
  • user_id (linked to invocation trace)

PII redaction:

  • Pre-write middleware scrubs emails, phone numbers, credit cards via regex
  • For customer data (marketing contacts), hash PII into opaque IDs before logging
  • Keep original data in Postgres (encrypted) if debugging requires it

Retention: 90 days in Langfuse (hot, queryable), 2 years in S3 (cold, compliance).

Layer 4: Cost Attribution

Your critical need. Per-user cost attribution for pricing tier decisions.

Implementation:

  • Every LLM call includes user_id in Langfuse metadata
  • Every tool call has cost attribution (HubSpot API = free, GA4 API = free, but compute = something)
  • Aggregate hourly into a user_costs Postgres table

Schema:

CREATE TABLE user_costs (
  user_id TEXT,
  organization_id TEXT,
  date DATE,
  hour INT,
  invocation_count INT,
  total_input_tokens INT,
  total_output_tokens INT,
  total_cost_usd DECIMAL(10,4),
  by_model JSONB,  -- {claude-sonnet: $0.45, claude-opus: $1.20}
  by_feature JSONB  -- {analyze: $0.80, draft: $0.85}
);

Dashboards:

  • Top 10% spenders (daily/weekly/monthly)
  • Cost-per-user distribution histogram
  • Unit economics: cost-per-user vs. plan-revenue-per-user
  • Feature cost breakdown (which features are expensive?)

Weekly cost-attribution report:

  • Emailed to product + exec
  • Identifies: top spenders, cost anomalies, features driving growth in spend

Layer 5: Eval Pipeline

Tool: Custom pipeline using Langfuse datasets + Claude as judge

Regression eval (runs on every prompt change):

  • Test set: 50 curated real user queries with known-good outputs
  • Run each through agent with new prompt
  • Use Claude Opus as judge: rates output vs. expected on 5-point scale
  • Flag regressions: any query scoring <3 when expected ≥4
  • Block deploy if >2 regressions

Drift eval (runs daily):

  • Sample 100 real production invocations
  • Have Claude Opus score them on 5-point scale for quality
  • Track rolling 7-day average score
  • Alert if score drops >0.3 below baseline

Eval implementation:

async def run_eval(dataset_id: str, agent_version: str):
    dataset = langfuse.get_dataset(dataset_id)
    results = []
    for item in dataset.items:
        response = await run_agent(item.input, agent_version=agent_version)
        score = await judge_with_claude(item.input, response, item.expected_output)
        results.append({'item_id': item.id, 'score': score, 'response': response})
        langfuse.log_eval_result(dataset_id, item.id, score)
    return summarize(results)

CI integration: GitHub Actions runs eval on every PR that touches prompts.

Layer 6: Alerting

Tool: Datadog (existing) + PagerDuty (existing)

Alert categories:

1. Availability alerts (P0):

- Agent invocation error rate > 5% over 10 min → PagerDuty

- Agent invocation p99 latency > 30s → PagerDuty

- Full service down (zero invocations in 5 min during business hours) → PagerDuty

2. Cost alerts (P1):

- Hourly cost > $500 (≈ 10x normal) → Slack #ai-ops

- Daily cost > $3,000 (≈ 5x normal) → Slack + PagerDuty

- Single user cost > $50/day → Slack (investigate possible abuse)

- Weekly cost spike detection (3σ above 7-day rolling mean) → Slack

3. Quality alerts (P2):

- Daily eval score drops >0.3 → Slack #ai-eng

- User-reported quality complaints spike (>3/hour) → Slack

- Hallucination detection (user reports 'wrong' feedback) > 5% over 1h → Slack

4. Tool alerts (P2):

- Tool call error rate (HubSpot/GA4 integration) > 10% over 15 min → Slack

- Tool call latency p99 > 10s → Slack

Stack Map

Your specific stack:

LayerToolCost/monthWhy
Distributed tracingOpenTelemetry (open-source)$0Industry standard
LLM traces + logsLangfuse (hosted)$499/mo (Team plan)Best-in-class LLM observability, integrates with Anthropic SDK
Metrics + dashboardsDatadog (existing)Existing line itemAlready deployed, can add LLM metrics
Log archive (compliance)S3 + Athena~$50/moCold storage for 2-year retention
Eval pipelineCustom Python + Langfuse datasets$0 (dev time)Flexible, no vendor lock-in
Cost attribution DBPostgres (existing RDS)$0 (existing)Join with user_id for attribution
AlertingDatadog + PagerDuty (existing)ExistingAlready deployed
PII redactionCustom middleware$0 (dev time)Compliance requirement

Total new spend: ~$550/month. Well within $2-5K budget.

Dashboards

Dashboard 1: Agent Health (primary on-call view)

  • Invocations/minute
  • Error rate
  • Latency P50 / P95 / P99
  • Tool call success rate (by tool)
  • Active user count

Dashboard 2: Cost Attribution

  • Cost today / this week / this month
  • Top 10 spenders today
  • Cost distribution histogram
  • Cost-per-user vs revenue-per-user (unit economics)
  • Feature cost breakdown

Dashboard 3: Quality

  • Daily eval score (rolling 7-day avg)
  • User thumbs-up/down rate
  • Hallucination report rate
  • Regression eval pass rate (CI)

Dashboard 4: Tool Performance

  • Tool latency (HubSpot / GA4 / CMS)
  • Tool error rate (per tool)
  • Tool call volume

Runbook Integration

When an alert fires at 3am:

1. On-call goes to Agent Health dashboard first — confirms scope (single-user vs systemic)

2. If systemic: check Dashboard 4 (Tool Performance) — often issue is upstream

3. If quality alert: check Langfuse traces for sample failures — reproduce + investigate

4. If cost spike: check Cost Attribution dashboard — identify spiking user, rate-limit if abuse

5. If eval regression: roll back last prompt change, investigate in AM

Runbook links in every alert to direct on-call to right dashboard.

Implementation Roadmap

Week 1: Foundation

  • Deploy OpenTelemetry in Python backend
  • Integrate Langfuse with Anthropic SDK
  • Structured logging with LLM-specific fields
  • PII redaction middleware

Week 2: Tracing + Cost

  • Tool-call spans
  • user_costs Postgres table + hourly aggregation job
  • Dashboard 1 (Agent Health) live in Datadog
  • Dashboard 2 (Cost Attribution) live in Datadog

Week 3: Evals

  • Build test dataset (50 curated queries, known-good outputs)
  • Implement regression eval
  • CI integration (GitHub Actions)
  • Start collecting user feedback (thumbs up/down)

Week 4: Alerts + Polish

  • All alerts wired in Datadog + PagerDuty
  • Runbook documentation
  • Team training session
  • 2-year S3 archive for compliance

Post-week 4: drift eval + hallucination detection + prompt versioning (roadmap items).

Key Takeaways

  • 6 observability layers: invocation trace, tool-call spans, prompt-response logs, cost attribution, eval pipeline, alerting. All essential for production.
  • Your stack: OpenTelemetry + Langfuse + Datadog (existing) + S3 (compliance archive). ~$550/mo new spend.
  • Per-user cost attribution is critical for your B2B model. Top 10% users likely drive 80% of cost. Dashboard surfaces this for pricing decisions.
  • Eval pipeline on every prompt change prevents silent regressions. 50-query test set + Claude Opus as judge + CI integration blocks bad deploys.
  • Implementation: 4 weeks. Start within 2 weeks — retrofitting observability at 40K invocations/day (your 6-month projection) is 5x harder than building now at 8K/day.

Common use cases

  • AI engineering teams taking agents from prototype to production
  • Platform teams building agent infrastructure at scale
  • DevOps/SRE teams supporting LLM-powered products
  • Startups preparing for AI product launch with observability day-1
  • Enterprise teams meeting SOC2/ISO compliance for AI systems
  • Agentic product teams (Claude Code-style, Cursor-style tools) needing visibility
  • Cost/FinOps teams attributing LLM spend across teams or customers
  • Eval/quality teams building regression suites for agent behavior
  • Security teams auditing LLM usage patterns

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Observability architecture requires reasoning about distributed systems, LLM-specific concerns, cost economics, and operational practices. Top-tier reasoning matters.

Pro tips

  • Instrument FIRST, optimize later. A production agent without traces is undebuggable. Add observability on day 1, even if the stack is simple (just structured logs is better than nothing).
  • Trace every LLM call with: model, tokens-in, tokens-out, cost, latency, prompt-version (if you version), tool-calls-requested, tool-calls-executed, final-output-length. These 8 fields cover 90% of debugging needs.
  • Cost attribution per USER is essential for B2B SaaS. 10% of your users probably drive 80% of costs. Without per-user attribution, you can't identify or charge them appropriately.
  • Eval pipelines should run on EVERY prompt change. Even 'small' prompt tweaks can degrade performance silently. Automated evals catch regressions before users do.
  • LLM latency has a long tail. P50 might be 2 seconds, P99 might be 20 seconds. Monitor P99 more than P50 — it's what your frustrated user experiences.
  • Log the PROMPT, not just the response. Debugging requires seeing what Claude actually received. Redact sensitive data at log-write time, not at query time.
  • For tool-using agents, log the tool-call GRAPH (what was called, what returned, in what order). Reconstructing from individual log lines is painful; structured graph data is debugger-friendly.
  • Alert on COST spikes, not just cost thresholds. Absolute threshold alerts fire too late. Spike detection (e.g., 3σ above rolling 7-day mean) catches problems earlier.

Customization tips

  • Start with LOGGING before TRACING. A single week of good structured logs gives you 70% of the debugging value. Traces are additive once logs work.
  • Don't skip PII redaction. GDPR + SOC2 both require it for user data. Retrofit is painful — build it in from first log line.
  • Cost attribution via Langfuse is built-in but the aggregation to your Postgres is custom work. Plan for 2-3 days of implementation for clean per-user rollups.
  • Your eval test set should grow over time. Every user-reported bug becomes a test case. Within 6 months, aim for 200+ test cases covering edge cases + common patterns.
  • For regulated industries (HIPAA, FINRA), add eval for compliance-specific concerns (e.g., 'did agent say anything that could be interpreted as medical/financial advice'). Compliance eval is its own pipeline.

Variants

Startup/MVP Mode

For teams shipping first production agent. Minimal stack — structured logs + cost tracking + basic alerts. Can upgrade later.

Scale-Up Mode

For teams at 100K+ agent invocations/month. Full stack — OpenTelemetry + Langfuse + Datadog + evals pipeline.

Enterprise Compliance Mode

For regulated industries. Emphasizes audit logs, data retention, PII handling, and compliance-ready monitoring.

Cost-Attribution Mode

For teams focused on FinOps. Emphasizes per-user + per-tool cost attribution, chargeback data, cost optimization dashboards.

Frequently asked questions

How do I use the AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents?

Claude Opus 4 or Sonnet 4.5. Observability architecture requires reasoning about distributed systems, LLM-specific concerns, cost economics, and operational practices. Top-tier reasoning matters.

Can I customize the AI Agent Observability Stack — Logs, Traces, Evals, Cost For Production Agents prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Instrument FIRST, optimize later. A production agent without traces is undebuggable. Add observability on day 1, even if the stack is simple (just structured logs is better than nothing).; Trace every LLM call with: model, tokens-in, tokens-out, cost, latency, prompt-version (if you version), tool-calls-requested, tool-calls-executed, final-output-length. These 8 fields cover 90% of debugging needs.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals