⚡ Promptolis Original · Coding & Development
🔭 Logging & Observability Instrumentor
Designs your structured logging + metrics + tracing strategy: what to log at what level, the 7 dimensions every event needs, and the dashboards that matter when production is on fire.
Why this is epic
Most logging is `console.log('here')` and `logger.info('user did thing')`. Then production breaks at 3am + you can't reconstruct what happened. This Original designs structured logging, metrics, and tracing that actually help you debug.
Outputs the complete strategy: log levels with explicit policies, structured fields (the 7 dimensions every event needs), metric instrumentation (RED method + USE method), distributed tracing setup, dashboard architecture, and the alert rules that wake you only when needed.
Calibrated to 2026 observability stack: OpenTelemetry as the standard, Datadog/New Relic/Honeycomb as the consumers, and the rise of vendored solutions vs the cost-control of self-hosted Grafana. Picks the right depth for your scale.
Includes the 'what NOT to log' section. Most observability cost is logs nobody reads. Log volume isn't quality; right-fields-at-right-events is.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<service-architecture>3 services: API (Node/Fastify), Worker (Node/BullMQ), Frontend (Next.js). All in Kubernetes on AWS EKS.</service-architecture>
<stack>Node.js 20 + TypeScript. Postgres (RDS), Redis. Multi-region: us-east + eu-west.</stack>
<scale>API: 200 req/sec average, 800 peak. Worker: 10K jobs/day. 4 nodes per service per region (8 total per service).</scale>
<current-observability>Datadog APM + Logs. Set up 18 months ago by a dev who left. ~1500 metrics + 50 dashboards, but team only uses ~5 dashboards. Logs are mostly stringy: `logger.info("User did thing")`. Trace IDs don't propagate to Worker (broken). Alerts page 4-5 times/week, 80% are noise.</current-observability>
<observability-stack>Datadog (locked in via AWS marketplace contract). Considering OTel for vendor flexibility.</observability-stack>
<biggest-pain>(1) Datadog bill is $4500/month. Up 60% YoY despite minor traffic growth. (2) On-call burnout from noisy alerts. (3) Last major incident: couldn't trace user request from API → Worker → Postgres because trace propagation is broken.</biggest-pain>
<budget-tolerance>Want to cut DD bill in half ($2K range) without losing critical signal.</budget-tolerance>
<sla-targets>99.9% availability, p95 latency <500ms.</sla-targets>
<team-size>4 backend engineers, 1 frontend, 1 DevOps. On-call rotation among the 4 backend.</team-size>
<integrations>Stripe, Twilio, OpenAI, Postmark. Internal service mesh (Istio).</integrations>
Observability Strategy: 3-Service Node Kubernetes Stack
Diagnosis
Three concrete issues:
1. Stringy logs make debugging slow. Can't query 'show me all errors for user_id=123 across services.'
2. Trace propagation broken between API → Worker. Distributed debugging impossible.
3. Datadog cost runaway: $4500/month is high for your scale (~200 req/sec). Likely because you log every request body + sample 100%.
Highest-leverage fixes:
1. Migrate to structured logging. Pino (Node) → Datadog. Foundation for queryable logs.
2. Fix trace propagation. OpenTelemetry SDK in API + Worker. Unblocks distributed debugging.
3. Implement sampling. 1% of successful requests, 100% of errors. Cuts log volume 80%+ → cuts DD bill ~50%.
4. Audit + delete unused dashboards. 50 → 8 active dashboards. Cuts noise.
5. Tune alerts to <1/day. 4-5 pages/week is burnout territory.
Estimated outcome: DD bill drops to ~$2K/month. Distributed traces work. Alerts page once or twice/week, all actionable.
The 7 Dimensions of Every Event
Every log line, span, and metric event includes:
{
// 1. Identity
service: 'api', // or 'worker', 'frontend'
environment: 'production', // or 'staging', 'development'
region: 'us-east-1', // for multi-region debugging
// 2. Trace correlation
trace_id: 'abc123def456', // OTel-generated, propagated
span_id: 'xyz789', // current span
request_id: 'req_456', // your app-level request ID
// 3. User context (when applicable)
user_id: 'user_789', // never email/PII
workspace_id: 'ws_456',
// 4. Event
event: 'http.request' | 'job.processed' | 'db.query' | 'external.api.call',
// 5. Level
level: 'info' | 'warn' | 'error' | 'debug',
// 6. Timing
timestamp: '2026-04-28T14:30:00.123Z',
duration_ms: 47, // for completed events
// 7. Outcome / context
status: 'success' | 'error' | 'partial',
error: { class: 'RetriableError', message: '...', stack: '...' },
context: { /* event-specific data */ }
}
These 7 dimensions enable: filter by service+region, follow request across services via trace_id, find all events for a user, group by event-type, slice by error status.
Log Levels Policy
error → CriticalError, unexpected throws, integrity violations
ALWAYS sent to Datadog
Sentry-grade events
Triggers alerts above thresholds
warn → RetriableError attempts, retry failures, circuit breaker state changes,
slow queries (>1s), elevated error rates within tolerable range
ALWAYS sent to Datadog
Daily digest
info → Standard operational events: HTTP request received/completed,
job started/completed, business events (subscription created, etc.)
SAMPLED in production (1% rate). 100% in staging.
Used for activity feeds + business metrics
debug → Verbose internal state. Variable values, branch decisions.
NOT in production by default.
Enable per-request via header for targeted debugging.
Structured Logging Implementation (Node.js + Pino)
// /lib/logger.ts
import pino from 'pino';
import { context, trace } from '@opentelemetry/api';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: process.env.SERVICE_NAME,
environment: process.env.NODE_ENV,
region: process.env.AWS_REGION,
version: process.env.GIT_SHA,
},
formatters: {
level: (label) => ({ level: label }),
},
mixin() {
// Auto-attach trace context to every log line
const span = trace.getActiveSpan();
if (span) {
const ctx = span.spanContext();
return { trace_id: ctx.traceId, span_id: ctx.spanId };
}
return {};
},
redact: {
paths: ['password', '*.password', 'token', '*.token', 'authorization', 'cookie', 'creditCard'],
censor: '[REDACTED]',
},
});
// Usage:
logger.info({ event: 'user.signup', user_id: 'user_123', source: 'organic' }, 'User signed up');
logger.warn({ event: 'rate_limit.hit', user_id: 'user_456', endpoint: '/api/search' });
logger.error({ event: 'job.failed', job_id: 'job_789', error: serializeError(e), error_class: e.constructor.name });
Critical patterns:
- Object first, then optional message string
- Always include
eventfield — categorical, queryable - Always include relevant IDs
- NEVER log passwords, tokens, full credit cards (Pino redact handles)
- Auto-attach trace context via mixin (so every log line is correlatable)
Metrics Instrumentation (RED + USE)
RED Method (per service endpoint/handler)
// Rate, Errors, Duration — track for every endpoint
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('api');
// Histogram for duration (gives you p50, p95, p99)
const requestDuration = meter.createHistogram('http.server.duration', {
description: 'HTTP request duration',
unit: 'ms',
});
// Counter for rate (total + by status)
const requestCount = meter.createCounter('http.server.requests', {
description: 'Total HTTP requests',
});
// In your middleware:
app.addHook('onResponse', (request, reply) => {
const duration = Date.now() - request.startTime;
const labels = {
method: request.method,
route: request.routerPath,
status_code: reply.statusCode,
status_class: `${Math.floor(reply.statusCode / 100)}xx`,
};
requestDuration.record(duration, labels);
requestCount.add(1, labels);
});
Metrics naming: <domain>.<thing>.<unit>. Examples:
http.server.duration(histogram, ms)http.server.requests(counter)db.query.duration(histogram, ms)queue.job.duration(histogram, ms)external.api.duration(histogram, ms, withservicelabel)
USE Method (per resource)
For each resource type:
- Utilization: % busy. CPU%, Memory%, connection-pool used / max.
- Saturation: queue depth. Pending requests, queued jobs, blocked threads.
- Errors: errors per resource. Connection errors, timeouts.
Datadog's APM auto-instruments many of these (CPU, memory, k8s pod metrics). Add custom for:
- Database connection pool:
db.pool.active,db.pool.waiting - BullMQ queue depth:
queue.sizeper queue name - Redis connection state
Distributed Tracing
Fix the API → Worker propagation:
// API service: when enqueueing a job
import { context, propagation, trace } from '@opentelemetry/api';
const span = trace.getActiveSpan();
const carrier = {};
propagation.inject(context.active(), carrier);
await queue.add('process-thing', {
data: jobData,
traceContext: carrier, // Pass trace context as job metadata
});
// Worker: when picking up a job
import { context, propagation, trace } from '@opentelemetry/api';
worker.process(async (job) => {
const parentContext = propagation.extract(context.active(), job.data.traceContext);
return context.with(parentContext, async () => {
const span = trace.getTracer('worker').startSpan('worker.process_job');
try {
// ... process job, all logs auto-attach trace_id
span.end();
} catch (e) {
span.recordException(e);
span.end();
throw e;
}
});
});
Now when you query Datadog for trace_id=abc123, you see API span → enqueue → Worker span → DB query → result. Reconstructable end-to-end.
What to trace:
- Every HTTP request (auto by OTel HTTP instrumentation)
- Every DB query (auto by OTel pg instrumentation)
- Every external API call (Stripe, OpenAI, etc.) — add custom spans
- Every queue job — manual span as above
- Every cache get/set (auto if you use OTel Redis instrumentation)
What NOT to trace:
- Trivial helper function calls (parsing, validation) — adds noise
- Internal loops — span per iteration explodes trace size
Sampling Strategy
Tail sampling via OTel collector:
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 1 }
Rules:
- 100% of error traces (status=ERROR)
- 100% of slow traces (>500ms)
- 1% of normal-success traces
Log sampling:
- error level: 100%
- warn level: 100%
- info level: 1% (for high-volume events like every HTTP request); 100% for important business events (subscription_created, payment_processed)
- debug level: 0% in production
At 200 req/sec × 86400s × 1% sampling = ~170K info logs/day instead of 17M. ~99% cost reduction on logs.
Dashboard Architecture
Delete the 50 existing dashboards. Build these 5:
1. Service Health (default open during incidents)
- Per-service: rate, error rate, p50/p95/p99 latency
- Resource health: CPU, memory per pod
- DB: connection pool, query latency p95
- Redis: connection count, latency
- External APIs: error rate per service (Stripe, OpenAI, Twilio)
- Active alerts panel
2. Business Metrics (daily review)
- Signups/hour
- Subscriptions created/hour
- Revenue/hour
- Active users
- Conversion funnel (signup → activated → paid)
3. SLO Dashboard (weekly review)
- Availability vs 99.9% SLO
- Latency SLO (p95 <500ms)
- Error budget remaining
- Burn rate (are we trending toward breach?)
4. Worker / Queue Health
- Job rate per queue
- Job failure rate
- DLQ depth
- Job duration p95
- Queue saturation
5. Cost Dashboard (monthly)
- Datadog cost trend
- Volume of logs / metrics / traces sent
- Top expensive metrics (which to drop)
Deletion criteria for existing 50 dashboards: if not opened by anyone in 60 days → delete. Datadog has 'last viewed' metadata.
Alert Rules
Page on-call (PagerDuty):
- Service availability <99.5% for 5 min (SLO breach risk)
- p95 latency >2× SLO for 5 min
- Error rate >2% for 5 min
- Database connection pool exhausted
- Queue depth >5x baseline for 10 min
- Critical service totally down (zero requests for 2 min on a service expecting traffic)
Slack #alerts (no page):
- Error rate >0.5% for 5 min
- p95 latency >SLO for 10 min
- DLQ growth >5 jobs in 5 min
- External API error rate spike (Stripe, OpenAI)
- New error class appearing
Daily digest:
- Top 5 error types by volume
- Latency p95 trend week-over-week
- Cost trend
- Slow queries from DB
Tuning rules:
- If an alert fires >3 times in a week without action being taken → either fix root cause OR raise threshold
- Page-rate target: <1 page per on-call shift on average
- Action rate target: 95%+ of pages should result in action (else they're noise)
What NOT to Log
1. Full request/response bodies on every request. This is likely your DD bill killer. Log shape (200 OK with status) but not body unless error.
2. Heartbeats / health checks. Pollute logs without value.
3. Successful auth tokens or session IDs. Log user_id only.
4. Verbose ORM / library internal logs. Set those libraries to warn-level.
5. Frontend page-load events. Use front-end RUM (Real User Monitoring) sparingly; not every page-load.
6. Repeated identical errors within 10s. Aggregate; don't flood.
7. Logs from automated tests in production environments. Should never happen but does.
Cost Estimation
Datadog pricing (rough, US East 2026):
- Logs: $0.10 / GB ingested (with retention adjustments)
- Custom metrics: $0.05 / metric / month
- APM: $31 / host / month + spans-per-month tier
- Synthetics: $5 / 10k tests / month
Your projected post-redesign:
- Logs: ~5GB/day after sampling = $15/day = $450/month (was likely $1500+)
- Custom metrics: ~80 metrics × $0.05 × 8 hosts = $32/month
- APM: 8 hosts × $31 = $248/month + reasonable span volume = ~$500/month
- Total estimate: ~$1200/month (cut from $4500)
If still over budget after sampling, next levers: log retention reduction (default 15-day → 7-day), metric reduction (drop low-value metrics), APM host count optimization.
OpenTelemetry Setup
// /otel/setup.ts (loaded at process startup)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME,
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA,
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
metricExporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instrument: HTTP, Express/Fastify, pg, Redis, AWS SDK
'@opentelemetry/instrumentation-fs': { enabled: false }, // skip file ops
}),
],
});
sdk.start();
For Datadog: route OTel collector → Datadog. Or use Datadog's OTel-compatible endpoint directly.
Why OTel even on Datadog: vendor flexibility. If you outgrow Datadog or want to evaluate alternatives in 2 years, your instrumentation stays the same — only the export endpoint changes.
Implementation Order
Week 1: Foundation
- Pino structured logging in API + Worker (replace console + ad-hoc Pino)
- OpenTelemetry SDK in API service
- Pino → Datadog log collector
- Update logger calls in 30 highest-traffic code paths
Week 2: Trace propagation
- OTel SDK in Worker
- Trace context propagation API → Worker
- Add custom spans for external API calls (Stripe, OpenAI, Twilio)
- Verify end-to-end trace works
Week 3: Sampling + cost reduction
- OTel collector with tail sampling
- Log-level sampling rules
- Drop high-volume non-error logs to 1%
- Drop request body logging
Week 4: Dashboards + alerts
- Build the 5 new dashboards
- Delete 45 old dashboards
- Tune alert thresholds (drop alerts that fire >3/week without action)
- Document the 5 dashboards in team wiki
Week 5: Tune
- Watch new alert volume
- Verify cost reduction in DD billing
- Address any blind spots discovered when alerts fire
What This Strategy Won't Solve
- Won't fix bugs. Observability tells you where + when bugs happen; you still need to fix them.
- Won't replace runbooks. When alerts fire, on-call needs documented response procedures.
- Won't scale linearly without continued tuning. As traffic grows, sampling rates + retention need re-evaluation.
- Won't help if alerts are ignored. Culture > tooling. Train team to investigate every page.
- Won't reduce cost forever. Observability cost grows with scale; expect ongoing cost-management as a discipline.
Maintenance Cadence
Weekly (on-call rotation):
- Review alerts that fired. Any noise? Tune.
- Check error budget burn rate.
Monthly (RevOps + DevOps):
- Datadog cost review. Top spending sources?
- Dashboard usage audit. Any unused?
- Sampling rate appropriate for current volume?
Quarterly:
- Full alert tuning review
- New dashboards needed for emerging operational areas?
- Vendor evaluation: still right tool? OTel makes switching cheaper.
Annually:
- Full observability strategy review
- Cost forecast vs business growth
- Tooling consolidation/expansion decisions
Key Takeaways
- Structured logging (Pino + 7 dimensions) is the foundation. Without it, queryable debugging is impossible.
- Fix trace propagation API → Worker. OpenTelemetry SDK + carrier in job metadata. Distributed debugging works again.
- Sample 1% of success, 100% of errors. Cuts log volume 99%, DD bill ~50%.
- 5 dashboards, not 50. Service Health (incident default), Business, SLO, Worker/Queue, Cost.
- Alert tuning target: <1 page per on-call shift, 95%+ actionable. Page-rate >2/shift = retune.
- OpenTelemetry baseline. Vendor SDKs only when necessary. Future-proofs your instrumentation.
Common use cases
- Engineer building a new service + wanting observability designed upfront
- Tech lead establishing observability standards across 5+ services
- Solo founder whose Datadog bill spiraled + needs to cut volume without losing signal
- On-call engineer who can't reconstruct incidents because logs are too thin or too noisy
- Team migrating from print-statements + ad-hoc to structured observability
- Engineer integrating OpenTelemetry into an existing codebase
Best AI model for this
Claude Opus 4. Observability design needs reasoning about debugging workflows, cost tradeoffs, and operational patterns — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Pro tips
- Structured > stringy. `logger.info({ event, userId, ... })` beats `logger.info('User 123 did thing')`. Queryable + machine-readable.
- RED method for services: Rate, Errors, Duration. USE method for resources: Utilization, Saturation, Errors. Both, not one.
- Trace IDs must propagate across services. Without them, distributed debugging is impossible.
- Log SAMPLED, not everything. 100% of events at 1K req/sec = $$$$ in observability spend. Sample non-error traffic; keep all errors.
- Dashboards exist for incidents. If you don't open it during an incident, delete it.
- Alerts wake humans. Tune to actionable. Page-on-everything = page-on-nothing within 2 weeks.
- OpenTelemetry is the 2026 standard. Don't lock into vendor-specific SDKs unless you have a reason.
Customization tips
- Be specific about service architecture. Single-service vs multi-service vs serverless need different trace propagation strategies.
- List your stack precisely. OTel auto-instrumentation differs per language/framework — Node has good auto coverage; Go is more manual.
- Specify your observability stack. Datadog vs New Relic vs Honeycomb pricing models differ; cost-reduction strategies are vendor-specific.
- Be honest about budget tolerance. Cost-aggressive design favors heavy sampling; signal-aggressive design favors more retention.
- Specify SLA targets. Alert thresholds calibrate to availability + latency goals.
- Use the Cost-Reduction Mode variant if your existing observability spend is the primary problem — it audits current logging + metrics and identifies cuts.
Variants
Greenfield Mode
For new services — designs observability from day one with the right OTel + structured-log baseline.
Cost-Reduction Mode
For existing systems with runaway observability spend — audits current logging + metrics, cuts noise, keeps signal.
Distributed Systems Mode
For multi-service architectures — emphasizes trace propagation, service mesh integration, cross-service correlation.
Solo SaaS Mode
For 1-3 person operations — picks minimum-viable observability that fits a small budget without losing critical signal.
Frequently asked questions
How do I use the Logging & Observability Instrumentor prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Logging & Observability Instrumentor?
Claude Opus 4. Observability design needs reasoning about debugging workflows, cost tradeoffs, and operational patterns — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Can I customize the Logging & Observability Instrumentor prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Structured > stringy. `logger.info({ event, userId, ... })` beats `logger.info('User 123 did thing')`. Queryable + machine-readable.; RED method for services: Rate, Errors, Duration. USE method for resources: Utilization, Saturation, Errors. Both, not one.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals