/
DE

⚡ Promptolis Original · Coding & Development

⚙️ Background Job Queue Designer

Designs your queue + worker architecture: which jobs in which queue, retry policy, priority lanes, the dead-letter queue — picking the right pattern between BullMQ / Sidekiq / SQS / Inngest for YOUR scale.

⏱️ 5 min to set up 🤖 ~110 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most teams pick a queue library first, then design backwards. Wrong order. This Original picks the right queue infrastructure for YOUR scale, then designs the architecture: queue partitioning, worker pools, retry policies, DLQ, and the priority-lane pattern that keeps user-waiting jobs fast.

Outputs the complete design: queue inventory (which queues, why), worker pool sizing, per-job-type retry + idempotency, priority strategy, DLQ pattern, observability, and the 'when to migrate to a different queue infra' triggers.

Calibrated to 2026 queue reality: BullMQ for Node, Sidekiq for Ruby, Celery for Python, SQS for AWS-native, Inngest for serverless-friendly. Each has trade-offs; the Original picks based on your actual constraints.

Includes the 5 anti-patterns most teams fall into: putting too much in one queue, lacking priority lanes, not using idempotency keys, ignoring the DLQ, retry storms during outages.

The prompt

Promptolis Original · Copy-ready
<role> You are a queue + async processing architect with 7+ years designing job systems on BullMQ, Sidekiq, Celery, SQS, Inngest, Temporal. You have shipped 30+ queue architectures handling millions of jobs/day combined. You know which patterns scale + which break. You are direct. You will tell a builder their single-queue design will hit head-of-line blocking, that their retry policy is causing storms, or that they need workflow orchestration not raw queues for multi-step processes. You refuse to recommend more queues as a generic fix — fewer right-shaped queues beats many fragmented ones. </role> <principles> 1. Queues by job class (priority, latency), not domain. 2. Idempotency keys at job creation. 3. Priority lanes. Don't mix latencies. 4. Exponential backoff + jitter. Avoid storms. 5. DLQ inspection day one. 6. Concurrency vs worker count: distinct concepts. 7. Monitor queue depth + age, not just rate. </principles> <input> <job-types>{the kinds of async work — emails, reports, image processing, etc.}</job-types> <volume>{jobs/day, peak/day, expected growth}</volume> <latency-tolerance>{per job type — instant / minutes / hours / batch overnight}</latency-tolerance> <job-shape>{IO-bound vs CPU-bound, average duration, max duration}</job-shape> <existing-state>{nothing / ad-hoc setTimeout / partial queue / mature but messy}</existing-state> <infrastructure>{Redis available? AWS? GCP? deployment model: containers / serverless?}</infrastructure> <scale-targets>{1y, 3y job volume projections}</scale-targets> <reliability-needs>{must-not-lose vs OK-to-drop-occasionally}</reliability-needs> <team>{who maintains; queue expertise level}</team> </input> <output-format> # Queue Architecture: [system] ## Queue Infrastructure Choice Which library/service for YOUR stack + scale. Why this not alternatives. ## Queue Inventory List of queues. For each: name, job class, priority, expected throughput, worker count. ## Per-Job-Type Specification For each job type: which queue, idempotency key strategy, retry policy, expected duration, failure handling. ## Priority Strategy How latency-sensitive jobs stay fast. Lane separation. ## Worker Pool Sizing Number of workers per queue. Concurrency per worker. CPU + memory per worker. ## Retry & DLQ Backoff policy, max attempts, DLQ structure, alert rules. ## Idempotency Pattern How to ensure jobs don't double-process. Key strategy. ## Observability Metrics: queue depth, age of oldest, throughput, failure rate. Alerts. ## Implementation Skeleton File structure + worker code patterns. ## Scaling Boundaries When this architecture hits limits. Triggers for migration. ## What This Architecture Won't Solve Honest limits. ## Migration from Existing If existing system: parallel-run + cutover plan. ## Maintenance Cadence When to revisit, audit, scale. ## Key Takeaways 4-6 bullets — for the team's playbook. </output-format> <auto-intake> If input incomplete: ask for job types, volume, latency tolerance, job shape, existing state, infrastructure, scale targets, reliability, team. </auto-intake> Now, design the queue architecture:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<job-types>(1) Send transactional email (welcome, receipt, password reset). (2) Generate PDF reports (monthly customer reports — bulk on month-end). (3) Image thumbnail processing (after upload). (4) Webhook delivery to customer endpoints (we send events to OUR customers' webhooks). (5) Sync to analytics (batch every 15 min). (6) Stripe webhook processing (incoming events from Stripe, processed async).</job-types>
<volume>Emails: ~1000/day, peak 5K/day. Reports: ~500 jobs on month-end day, 1 job per customer. Images: ~2K/day. Webhook delivery: ~10K/day. Analytics sync: 96/day (every 15 min). Stripe webhooks: 5K/day, peak 20K on first-of-month.</volume>
<latency-tolerance>Emails: <30s for transactional (user expects). Reports: 24h OK. Images: <10s (user uploads, expects to see thumbnail soon). Webhook delivery: <1 min ideal. Analytics sync: scheduled, ±1 min OK. Stripe webhooks: <30s for customer-visible state changes.</latency-tolerance>
<job-shape>Emails: IO-bound, ~500ms each. Reports: CPU + IO, 30s-5min each. Images: CPU-bound (thumbnail generation), ~3s each. Webhook delivery: IO-bound, ~1s. Analytics: IO-bound, ~30s per batch. Stripe webhooks: IO-bound, ~500ms.</job-shape>
<existing-state>Mixed: emails go through Sendgrid directly synchronously (sometimes timeout). Reports use a setTimeout cron + sync code. Images use a hand-rolled BullMQ setup. Webhooks have no retry. Analytics sync is ad-hoc cron. Stripe webhooks are sync (sometimes timeout).</existing-state>
<infrastructure>Node.js 20 + Postgres + Redis (Upstash). Vercel for API. Considering moving workers off Vercel since they're serverless-unfriendly.</infrastructure>
<scale-targets>2x growth in 1y. 5x in 3y. So peak ~100K jobs/day at 3y.</scale-targets>
<reliability-needs>Emails: must not lose (user expects). Reports: must not lose (compliance). Images: OK to retry user-side if lost. Webhook delivery: must deliver (we promise). Analytics: OK to skip if 1 batch. Stripe webhooks: must process (financial data).</reliability-needs>
<team>4 backend engineers. 2 have BullMQ experience. None have Temporal/Inngest experience.</team>
🤖 Output

Queue Architecture: B2B SaaS Multi-Job-Type System

Queue Infrastructure Choice

BullMQ on dedicated Redis (Upstash + Fly.io workers). Reasons:

  • Team has BullMQ experience; switching costs are real
  • Volume (~20K/day, ~100K target) is well within BullMQ capability
  • Redis already in stack; no new infrastructure
  • Move workers off Vercel onto Fly.io / Railway / dedicated long-running infra

Alternatives considered:

  • AWS SQS: would work, but you don't have AWS-native infra; adds vendor lock-in
  • Inngest: great for workflow orchestration but overkill for your job types (mostly single-step, not multi-step workflows)
  • Temporal: great for complex workflows, but learning curve + complexity not justified for your job mix

Reconsider migration to Inngest or Temporal IF you start needing complex multi-step workflows (saga pattern, long-running approvals, complex state machines). Right now: BullMQ.

Queue Inventory

Queue                  Priority    Latency Target   Avg Throughput
─────────────────────────────────────────────────────────────────
user-fast              HIGH        <10s             ~3K/day
  (image thumbnails, password reset emails)

user-medium            MEDIUM      <60s             ~6K/day
  (transactional emails, Stripe webhook processing)

outbound-webhooks      MEDIUM      <60s             ~10K/day
  (delivery to customer webhook URLs — separate due to potentially-slow customer endpoints)

batch-fast             LOW         <15min           ~100/day
  (analytics sync, small reports)

batch-slow             LOW         <24h             ~500-2K/day spike
  (monthly customer reports, end-of-month bulk)

DLQ                    n/a         n/a              accumulator

Why these splits:

  • user-fast keeps user-waiting jobs out of the same lane as long-running reports
  • outbound-webhooks separated because customer endpoints can be slow (10s timeout) — don't block other jobs
  • batch-slow handles month-end spikes without affecting daily ops

Per-Job-Type Specification

Transactional emails
  • Queue: user-fast for password resets (user is waiting); user-medium for receipts (less urgent)
  • Idempotency key: email:${userId}:${templateId}:${eventId} — same key = same email, deduped
  • Retry policy: 3 attempts, 1min/5min/15min backoff
  • Failure: DLQ; daily review by support team
Image thumbnails
  • Queue: user-fast
  • Idempotency key: thumb:${imageId}:${size} — keyed by image + size variant
  • Retry policy: 2 attempts, immediate then 30s backoff
  • Failure: DLQ; user can retrigger by re-uploading
  • Note: CPU-bound — limit concurrency per worker (1-2 per CPU)
PDF reports (monthly bulk)
  • Queue: batch-slow
  • Idempotency key: report:${customerId}:${period} — once per customer-period
  • Retry policy: 5 attempts, 5min/15min/30min/1h/2h backoff
  • Failure: DLQ; alert support
  • Note: memory-heavy; size workers accordingly (~512MB each)
Webhook delivery (to customer URLs)
  • Queue: outbound-webhooks
  • Idempotency key: webhook:${eventId} — enforced by us, not by customer
  • Retry policy: 6 attempts over 24h: 1m, 5m, 15m, 1h, 4h, 24h backoff
  • Failure: DLQ + customer notification email + UI shows 'webhook failed'
  • Timeout: 10s per delivery attempt (customer endpoints may be slow)
  • Note: customer-facing — clear error reporting matters
Analytics sync
  • Queue: batch-fast
  • Idempotency key: analytics-sync:${batchTimestamp} — one per 15-min window
  • Retry policy: 3 attempts
  • Failure: DLQ + alert; reconcile with next batch
  • Schedule: cron at :00, :15, :30, :45 of each hour
Stripe webhook processing
  • Queue: user-medium
  • Idempotency key: stripe:${eventId} — Stripe's event.id
  • Retry policy: 5 attempts, exponential 5s/25s/125s/625s/52min
  • Failure: DLQ + alert (financial data — high priority)
  • Note: event handlers idempotent at DB level (UPSERT, unique constraint)

Priority Strategy

BullMQ supports priority within a queue + multiple queues. Use both:

Across queues: workers process queues in declared order. user-fast workers checked first, then user-medium, etc.

Within queue: for user-fast and user-medium, use BullMQ's priority option — password reset (priority 1) before receipt (priority 5).

await emailQueue.add('send-email', { ... }, { priority: 1 }); // password reset
await emailQueue.add('send-email', { ... }, { priority: 5 }); // receipt

Lower number = higher priority.

Worker Pool Sizing

Per queue:

  • user-fast: 2 workers, concurrency 10 each (IO + light CPU). Total: 20 jobs concurrent.
  • user-medium: 2 workers, concurrency 20 each (IO-heavy). Total: 40 concurrent.
  • outbound-webhooks: 1 worker, concurrency 50 (highly IO-bound). Total: 50 concurrent.
  • batch-fast: 1 worker, concurrency 5 (IO + some CPU). Total: 5 concurrent.
  • batch-slow: 2 workers, concurrency 1 each (CPU + memory heavy). Total: 2 concurrent.

Reasoning:

  • IO-bound jobs benefit from high concurrency (Node async)
  • CPU-bound jobs limit to ~1 per CPU core to avoid contention
  • Memory-heavy jobs: low concurrency, dedicated workers

Total worker resources:

  • 5 small workers (1 vCPU, 512MB) for fast queues
  • 2 medium workers (2 vCPU, 1GB) for batch-slow
  • ~$50-80/month on Fly.io / Railway

Retry & DLQ

Backoff strategy (exponential + jitter):

await queue.add('job-name', data, {
  attempts: 5,
  backoff: {
    type: 'exponential',
    delay: 5000, // 5s base
  },
  // BullMQ adds jitter automatically with `removeDelay` randomization
});

DLQ pattern:

  • BullMQ's failed job state is your DLQ
  • Don't auto-delete failed jobs
  • Build admin tool to inspect + replay

Admin DLQ tool:

// /admin/dlq
GET /admin/dlq?queue=user-fast&since=2026-04-25
// Returns: list of failed jobs with id, name, lastError, failedAt

POST /admin/dlq/replay
  body: { jobId: 'abc123' }
// Re-enqueues to original queue

POST /admin/dlq/replay-bulk
  body: { queue: 'user-fast', failedSince: '2026-04-28', errorPattern: 'TimeoutError' }
// Bulk replay matching pattern

DELETE /admin/dlq/:jobId
// Permanently delete

Alerts:

  • DLQ size >0 for 'must-not-lose' queues (emails, reports, webhooks, Stripe) → page within 30 min
  • DLQ size >50 for any queue → Slack alert
  • DLQ growth rate >10/hour → Slack alert (something systemic)
  • Oldest-failed-job-age >24h → daily digest

Idempotency Pattern

Key strategy by job type listed above. Critical patterns:

// Use BullMQ's jobId for queue-level dedup
await queue.add('job-name', data, {
  jobId: idempotencyKey,  // BullMQ rejects duplicate jobIds
});

// Defense in depth: handler also idempotent at DB layer
async function processEmail(jobData) {
  const dedupKey = `email:${jobData.userId}:${jobData.templateId}:${jobData.eventId}`;
  const alreadySent = await db.email_sent.findUnique({ where: { dedupKey } });
  if (alreadySent) return { status: 'duplicate' };
  
  // Send email
  const result = await sendgrid.send(...);
  
  // Record success (transaction with the actual side effect)
  await db.email_sent.create({ dedupKey, sendgridId: result.id });
  
  return { status: 'sent' };
}

Observability

Per queue, track:

  • queue.depth (gauge: pending jobs)
  • queue.age.oldest (gauge: how stale is the oldest pending job)
  • queue.processed.rate (counter: completed/sec)
  • queue.failed.rate (counter: failed/sec)
  • queue.duration (histogram: job processing time)
  • queue.wait.duration (histogram: time from enqueue to start)

Alert rules:

  • queue.age.oldest for user-fast >30s → page (user-waiting backlog)
  • queue.depth for user-medium >500 sustained 10min → Slack (capacity issue)
  • queue.failed.rate >5% → Slack (something's wrong)
  • queue.depth for batch-slow >2000 during month-end → expected, no alert
  • Worker process not consuming for >5 min → page (worker died)

Dashboard:

  • Per-queue depth + age over last 24h
  • Top 5 failing job types
  • DLQ size trends
  • Worker resource utilization

Implementation Skeleton

/services/queue/
  src/
    queues.ts             (queue configurations + connections)
    types.ts              (job data type definitions)
  package.json

/services/worker/
  src/
    workers/
      user-fast-worker.ts
      user-medium-worker.ts
      outbound-webhooks-worker.ts
      batch-fast-worker.ts
      batch-slow-worker.ts
    handlers/
      send-email.ts
      generate-thumbnail.ts
      generate-pdf-report.ts
      deliver-webhook.ts
      sync-analytics.ts
      process-stripe-event.ts
    metrics.ts
    server.ts             (worker process bootstrap)
  Dockerfile
  fly.toml

/services/api/  (your existing)
  uses queue.ts to ENQUEUE jobs
  doesn't process them

Worker bootstrap pattern:

// /services/worker/src/server.ts
import { Worker } from 'bullmq';
import { redis } from './redis';
import { handleEmail } from './handlers/send-email';
import { handleThumbnail } from './handlers/generate-thumbnail';
// ... other handlers

const userFastWorker = new Worker(
  'user-fast',
  async (job) => {
    switch (job.name) {
      case 'send-email': return handleEmail(job.data);
      case 'generate-thumbnail': return handleThumbnail(job.data);
      default: throw new Error(`Unknown job ${job.name}`);
    }
  },
  {
    connection: redis,
    concurrency: 10,
  }
);

// ... similar for other queues

userFastWorker.on('failed', (job, err) => {
  logger.error({ jobId: job?.id, error: err }, 'Job failed');
  metrics.increment('queue.failed', { queue: 'user-fast', name: job?.name });
});

userFastWorker.on('completed', (job, result) => {
  metrics.increment('queue.completed', { queue: 'user-fast', name: job.name });
});

Scaling Boundaries

This architecture handles:

  • Up to ~500K jobs/day comfortably
  • Up to ~2K/sec sustained
  • 5x your current peak month-end volume

Migration triggers (>3y from now):

  • Total volume >1M jobs/day → consider sharding queues across Redis Cluster
  • Need workflow orchestration (multi-step jobs with state) → migrate to Inngest or Temporal
  • Need exactly-once semantics (BullMQ is at-least-once) → migrate to Temporal
  • Cross-region replication needs → managed services (AWS SQS, Google Pub/Sub)

What This Architecture Won't Solve

  • Won't handle exactly-once delivery. BullMQ is at-least-once; idempotency at handler is your defense.
  • Won't enforce ordering across queues. Within a queue, FIFO. Across queues, no ordering.
  • Won't replace workflow engines. Long-running multi-step (with checkpoints, compensations) needs Inngest/Temporal.
  • Won't compensate for poor handler design. If handlers aren't idempotent, retries cause data corruption — queue can't fix.
  • Won't auto-scale workers. You need to monitor + manually scale (or set up auto-scaling separately).

Migration from Existing

Week 1: Infrastructure
  • Set up dedicated worker service on Fly.io / Railway
  • BullMQ queues defined
  • Move existing image thumbnail BullMQ to new structure
Week 2: Email migration
  • Replace synchronous Sendgrid with email queue + worker
  • Idempotency keys at enqueue
  • Test in staging — verify no email duplicates
Week 3: Stripe webhook migration
  • Build outbound-webhooks queue
  • Move Stripe webhook processing to user-medium queue
  • Receiver does sig verify + enqueue (Receiver pattern from webhook-handler-architect)
Week 4: Reports + analytics
  • Move PDF report generation to batch-slow
  • Move analytics sync to batch-fast cron schedule
Week 5: Webhook delivery to customers
  • Build outbound-webhooks queue with retry
  • Migrate from no-retry direct send
  • DLQ + customer notification on permanent failure
Week 6: Observability + DLQ tools
  • Datadog dashboard
  • Admin DLQ replay tool
  • Alert configuration

Maintenance Cadence

Weekly:

  • DLQ review per queue. Patterns?
  • Queue depth + age trends — anything degrading?

Monthly:

  • Worker resource utilization. Time to scale up/down?
  • Job duration trends — anything getting slower?
  • Cost review (Redis + worker hosts)

Quarterly:

  • Architecture audit. New job types added? Queue partitioning still right?
  • Scale test: replay 3 days of historical jobs to validate capacity

Key Takeaways

  • 5 queues by job class (priority + latency), not by domain. user-fast / user-medium / outbound-webhooks / batch-fast / batch-slow.
  • BullMQ for your scale + team. Inngest/Temporal only when complex workflows justify.
  • Move workers off Vercel to dedicated infra. Serverless is wrong for long-running queue workers.
  • Idempotency keys at job creation + DB-level dedup at handler. Defense in depth.
  • DLQ inspection tool from day one. Build admin replay UI; don't wait until you need it at 3am.
  • Migrate in 6 weeks. Email first (highest current pain), then Stripe webhooks, then reports + analytics + outbound webhooks.

Common use cases

  • Engineer adding async processing to a synchronous service
  • Backend lead consolidating ad-hoc workers into structured queue infrastructure
  • Solo founder hitting 'queue gets stuck' problems
  • Team migrating from SQS to BullMQ (or any queue migration)
  • Architect designing for high-throughput batch processing
  • Engineer evaluating Inngest / Trigger.dev / Temporal for workflow orchestration

Best AI model for this

Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

  • One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'
  • Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.
  • Priority lanes prevent slow jobs from blocking fast ones. Don't put a 30-min report job in the same queue as 100ms user-waiting jobs.
  • Retry with exponential backoff + jitter. Hammer-retry causes downstream cascades.
  • DLQ inspection tool from day one. 'Why is X stuck?' must be answerable in 30 seconds.
  • Job concurrency != worker count. 1 worker can process 50 concurrent IO-bound jobs (async). 1 worker per CPU-bound job.
  • Monitor queue depth + age, not just job rate. A queue with stable depth + growing oldest-job age means workers are falling behind.

Customization tips

  • List ALL job types with realistic volume + latency. The queue partitioning depends on knowing the full job mix.
  • Be honest about job shape (IO-bound vs CPU-bound). Concurrency settings differ dramatically; CPU-bound jobs need worker isolation.
  • Specify reliability per job type. 'Must not lose' vs 'OK to drop' shapes retry policy + DLQ alerts.
  • Mention infrastructure constraints (existing Redis, AWS-native, serverless-only). Queue choice depends on what you can deploy.
  • Specify scale targets at multiple time horizons. 1y vs 3y projections shape architecture (BullMQ at 100K/day vs 10M/day differs).
  • Use the Workflow Orchestration Mode variant if your jobs are multi-step processes (signup → verify → onboard → trigger emails) — Temporal/Inngest beats raw queues for that.

Variants

BullMQ / Redis Mode

For Node.js + Redis stacks — emphasizes BullMQ patterns, Redis Cluster scaling.

AWS SQS Mode

For AWS-native — emphasizes SQS standard vs FIFO, Lambda integration, DLQ patterns.

Workflow Orchestration Mode

For long-running multi-step flows — Inngest / Trigger.dev / Temporal architecture for stateful workflows.

Migration Mode

For migrating from one queue infra to another — emphasizes parallel-run, cutover, rollback.

Frequently asked questions

How do I use the Background Job Queue Designer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Background Job Queue Designer?

Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Background Job Queue Designer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'; Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals