⚡ Promptolis Original · Coding & Development

⚙️ Background Job Queue Designer

Designs your queue + worker architecture: which jobs in which queue, retry policy, priority lanes, the dead-letter queue — picking the right pattern between BullMQ / Sidekiq / SQS / Inngest for YOUR scale.

⏱️ 5 min to set up 🤖 ~110 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most teams pick a queue library first, then design backwards. Wrong order. This Original picks the right queue infrastructure for YOUR scale, then designs the architecture: queue partitioning, worker pools, retry policies, DLQ, and the priority-lane pattern that keeps user-waiting jobs fast.

Outputs the complete design: queue inventory (which queues, why), worker pool sizing, per-job-type retry + idempotency, priority strategy, DLQ pattern, observability, and the 'when to migrate to a different queue infra' triggers.

Calibrated to 2026 queue reality: BullMQ for Node, Sidekiq for Ruby, Celery for Python, SQS for AWS-native, Inngest for serverless-friendly. Each has trade-offs; the Original picks based on your actual constraints.

Includes the 5 anti-patterns most teams fall into: putting too much in one queue, lacking priority lanes, not using idempotency keys, ignoring the DLQ, retry storms during outages.

The prompt

Promptolis Original · Copy-ready

<role> You are a queue + async processing architect with 7+ years designing job systems on BullMQ, Sidekiq, Celery, SQS, Inngest, Temporal. You have shipped 30+ queue architectures handling millions of jobs/day combined. You know which patterns scale + which break. You are direct. You will tell a builder their single-queue design will hit head-of-line blocking, that their retry policy is causing storms, or that they need workflow orchestration not raw queues for multi-step processes. You refuse to recommend more queues as a generic fix — fewer right-shaped queues beats many fragmented ones. </role> <principles> 1. Queues by job class (priority, latency), not domain. 2. Idempotency keys at job creation. 3. Priority lanes. Don't mix latencies. 4. Exponential backoff + jitter. Avoid storms. 5. DLQ inspection day one. 6. Concurrency vs worker count: distinct concepts. 7. Monitor queue depth + age, not just rate. </principles> <input> <job-types>{the kinds of async work — emails, reports, image processing, etc.}</job-types> <volume>{jobs/day, peak/day, expected growth}</volume> <latency-tolerance>{per job type — instant / minutes / hours / batch overnight}</latency-tolerance> <job-shape>{IO-bound vs CPU-bound, average duration, max duration}</job-shape> <existing-state>{nothing / ad-hoc setTimeout / partial queue / mature but messy}</existing-state> <infrastructure>{Redis available? AWS? GCP? deployment model: containers / serverless?}</infrastructure> <scale-targets>{1y, 3y job volume projections}</scale-targets> <reliability-needs>{must-not-lose vs OK-to-drop-occasionally}</reliability-needs> <team>{who maintains; queue expertise level}</team> </input> <output-format> # Queue Architecture: [system] ## Queue Infrastructure Choice Which library/service for YOUR stack + scale. Why this not alternatives. ## Queue Inventory List of queues. For each: name, job class, priority, expected throughput, worker count. ## Per-Job-Type Specification For each job type: which queue, idempotency key strategy, retry policy, expected duration, failure handling. ## Priority Strategy How latency-sensitive jobs stay fast. Lane separation. ## Worker Pool Sizing Number of workers per queue. Concurrency per worker. CPU + memory per worker. ## Retry & DLQ Backoff policy, max attempts, DLQ structure, alert rules. ## Idempotency Pattern How to ensure jobs don't double-process. Key strategy. ## Observability Metrics: queue depth, age of oldest, throughput, failure rate. Alerts. ## Implementation Skeleton File structure + worker code patterns. ## Scaling Boundaries When this architecture hits limits. Triggers for migration. ## What This Architecture Won't Solve Honest limits. ## Migration from Existing If existing system: parallel-run + cutover plan. ## Maintenance Cadence When to revisit, audit, scale. ## Key Takeaways 4-6 bullets — for the team's playbook. </output-format> <auto-intake> If input incomplete: ask for job types, volume, latency tolerance, job shape, existing state, infrastructure, scale targets, reliability, team. </auto-intake> Now, design the queue architecture:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<job-types>(1) Send transactional email (welcome, receipt, password reset). (2) Generate PDF reports (monthly customer reports — bulk on month-end). (3) Image thumbnail processing (after upload). (4) Webhook delivery to customer endpoints (we send events to OUR customers' webhooks). (5) Sync to analytics (batch every 15 min). (6) Stripe webhook processing (incoming events from Stripe, processed async).</job-types>
<volume>Emails: ~1000/day, peak 5K/day. Reports: ~500 jobs on month-end day, 1 job per customer. Images: ~2K/day. Webhook delivery: ~10K/day. Analytics sync: 96/day (every 15 min). Stripe webhooks: 5K/day, peak 20K on first-of-month.</volume>
<latency-tolerance>Emails: <30s for transactional (user expects). Reports: 24h OK. Images: <10s (user uploads, expects to see thumbnail soon). Webhook delivery: <1 min ideal. Analytics sync: scheduled, ±1 min OK. Stripe webhooks: <30s for customer-visible state changes.</latency-tolerance>
<job-shape>Emails: IO-bound, ~500ms each. Reports: CPU + IO, 30s-5min each. Images: CPU-bound (thumbnail generation), ~3s each. Webhook delivery: IO-bound, ~1s. Analytics: IO-bound, ~30s per batch. Stripe webhooks: IO-bound, ~500ms.</job-shape>
<existing-state>Mixed: emails go through Sendgrid directly synchronously (sometimes timeout). Reports use a setTimeout cron + sync code. Images use a hand-rolled BullMQ setup. Webhooks have no retry. Analytics sync is ad-hoc cron. Stripe webhooks are sync (sometimes timeout).</existing-state>
<infrastructure>Node.js 20 + Postgres + Redis (Upstash). Vercel for API. Considering moving workers off Vercel since they're serverless-unfriendly.</infrastructure>
<scale-targets>2x growth in 1y. 5x in 3y. So peak ~100K jobs/day at 3y.</scale-targets>
<reliability-needs>Emails: must not lose (user expects). Reports: must not lose (compliance). Images: OK to retry user-side if lost. Webhook delivery: must deliver (we promise). Analytics: OK to skip if 1 batch. Stripe webhooks: must process (financial data).</reliability-needs>
<team>4 backend engineers. 2 have BullMQ experience. None have Temporal/Inngest experience.</team>

🤖 Output

Queue Architecture: B2B SaaS Multi-Job-Type System

Queue Infrastructure Choice

BullMQ on dedicated Redis (Upstash + Fly.io workers). Reasons:

Team has BullMQ experience; switching costs are real
Volume (~20K/day, ~100K target) is well within BullMQ capability
Redis already in stack; no new infrastructure
Move workers off Vercel onto Fly.io / Railway / dedicated long-running infra

Alternatives considered:

AWS SQS: would work, but you don't have AWS-native infra; adds vendor lock-in
Inngest: great for workflow orchestration but overkill for your job types (mostly single-step, not multi-step workflows)
Temporal: great for complex workflows, but learning curve + complexity not justified for your job mix

Reconsider migration to Inngest or Temporal IF you start needing complex multi-step workflows (saga pattern, long-running approvals, complex state machines). Right now: BullMQ.

Queue Inventory

Queue                  Priority    Latency Target   Avg Throughput
─────────────────────────────────────────────────────────────────
user-fast              HIGH        <10s             ~3K/day
  (image thumbnails, password reset emails)

user-medium            MEDIUM      <60s             ~6K/day
  (transactional emails, Stripe webhook processing)

outbound-webhooks      MEDIUM      <60s             ~10K/day
  (delivery to customer webhook URLs — separate due to potentially-slow customer endpoints)

batch-fast             LOW         <15min           ~100/day
  (analytics sync, small reports)

batch-slow             LOW         <24h             ~500-2K/day spike
  (monthly customer reports, end-of-month bulk)

DLQ                    n/a         n/a              accumulator

Why these splits:

user-fast keeps user-waiting jobs out of the same lane as long-running reports
outbound-webhooks separated because customer endpoints can be slow (10s timeout) — don't block other jobs
batch-slow handles month-end spikes without affecting daily ops

Per-Job-Type Specification

Transactional emails

Queue: user-fast for password resets (user is waiting); user-medium for receipts (less urgent)
Idempotency key: email:${userId}:${templateId}:${eventId} — same key = same email, deduped
Retry policy: 3 attempts, 1min/5min/15min backoff
Failure: DLQ; daily review by support team

Image thumbnails

Queue: user-fast
Idempotency key: thumb:${imageId}:${size} — keyed by image + size variant
Retry policy: 2 attempts, immediate then 30s backoff
Failure: DLQ; user can retrigger by re-uploading
Note: CPU-bound — limit concurrency per worker (1-2 per CPU)

PDF reports (monthly bulk)

Queue: batch-slow
Idempotency key: report:${customerId}:${period} — once per customer-period
Retry policy: 5 attempts, 5min/15min/30min/1h/2h backoff
Failure: DLQ; alert support
Note: memory-heavy; size workers accordingly (~512MB each)

Webhook delivery (to customer URLs)

Queue: outbound-webhooks
Idempotency key: webhook:${eventId} — enforced by us, not by customer
Retry policy: 6 attempts over 24h: 1m, 5m, 15m, 1h, 4h, 24h backoff
Failure: DLQ + customer notification email + UI shows 'webhook failed'
Timeout: 10s per delivery attempt (customer endpoints may be slow)
Note: customer-facing — clear error reporting matters

Analytics sync

Queue: batch-fast
Idempotency key: analytics-sync:${batchTimestamp} — one per 15-min window
Retry policy: 3 attempts
Failure: DLQ + alert; reconcile with next batch
Schedule: cron at :00, :15, :30, :45 of each hour

Stripe webhook processing

Queue: user-medium
Idempotency key: stripe:${eventId} — Stripe's event.id
Retry policy: 5 attempts, exponential 5s/25s/125s/625s/52min
Failure: DLQ + alert (financial data — high priority)
Note: event handlers idempotent at DB level (UPSERT, unique constraint)

Priority Strategy

BullMQ supports priority within a queue + multiple queues. Use both:

Across queues: workers process queues in declared order. user-fast workers checked first, then user-medium, etc.

Within queue: for user-fast and user-medium, use BullMQ's priority option — password reset (priority 1) before receipt (priority 5).

await emailQueue.add('send-email', { ... }, { priority: 1 }); // password reset
await emailQueue.add('send-email', { ... }, { priority: 5 }); // receipt

Lower number = higher priority.

Worker Pool Sizing

Per queue:

user-fast: 2 workers, concurrency 10 each (IO + light CPU). Total: 20 jobs concurrent.
user-medium: 2 workers, concurrency 20 each (IO-heavy). Total: 40 concurrent.
outbound-webhooks: 1 worker, concurrency 50 (highly IO-bound). Total: 50 concurrent.
batch-fast: 1 worker, concurrency 5 (IO + some CPU). Total: 5 concurrent.
batch-slow: 2 workers, concurrency 1 each (CPU + memory heavy). Total: 2 concurrent.

Reasoning:

IO-bound jobs benefit from high concurrency (Node async)
CPU-bound jobs limit to ~1 per CPU core to avoid contention
Memory-heavy jobs: low concurrency, dedicated workers

Total worker resources:

5 small workers (1 vCPU, 512MB) for fast queues
2 medium workers (2 vCPU, 1GB) for batch-slow
~$50-80/month on Fly.io / Railway

Retry & DLQ

Backoff strategy (exponential + jitter):

await queue.add('job-name', data, {
  attempts: 5,
  backoff: {
    type: 'exponential',
    delay: 5000, // 5s base
  },
  // BullMQ adds jitter automatically with `removeDelay` randomization
});

DLQ pattern:

BullMQ's failed job state is your DLQ
Don't auto-delete failed jobs
Build admin tool to inspect + replay

Admin DLQ tool:

// /admin/dlq
GET /admin/dlq?queue=user-fast&since=2026-04-25
// Returns: list of failed jobs with id, name, lastError, failedAt

POST /admin/dlq/replay
  body: { jobId: 'abc123' }
// Re-enqueues to original queue

POST /admin/dlq/replay-bulk
  body: { queue: 'user-fast', failedSince: '2026-04-28', errorPattern: 'TimeoutError' }
// Bulk replay matching pattern

DELETE /admin/dlq/:jobId
// Permanently delete

Alerts:

DLQ size >0 for 'must-not-lose' queues (emails, reports, webhooks, Stripe) → page within 30 min
DLQ size >50 for any queue → Slack alert
DLQ growth rate >10/hour → Slack alert (something systemic)
Oldest-failed-job-age >24h → daily digest

Idempotency Pattern

Key strategy by job type listed above. Critical patterns:

// Use BullMQ's jobId for queue-level dedup
await queue.add('job-name', data, {
  jobId: idempotencyKey,  // BullMQ rejects duplicate jobIds
});

// Defense in depth: handler also idempotent at DB layer
async function processEmail(jobData) {
  const dedupKey = `email:${jobData.userId}:${jobData.templateId}:${jobData.eventId}`;
  const alreadySent = await db.email_sent.findUnique({ where: { dedupKey } });
  if (alreadySent) return { status: 'duplicate' };
  
  // Send email
  const result = await sendgrid.send(...);
  
  // Record success (transaction with the actual side effect)
  await db.email_sent.create({ dedupKey, sendgridId: result.id });
  
  return { status: 'sent' };
}

Observability

Per queue, track:

queue.depth (gauge: pending jobs)
queue.age.oldest (gauge: how stale is the oldest pending job)
queue.processed.rate (counter: completed/sec)
queue.failed.rate (counter: failed/sec)
queue.duration (histogram: job processing time)
queue.wait.duration (histogram: time from enqueue to start)

Alert rules:

queue.age.oldest for user-fast >30s → page (user-waiting backlog)
queue.depth for user-medium >500 sustained 10min → Slack (capacity issue)
queue.failed.rate >5% → Slack (something's wrong)
queue.depth for batch-slow >2000 during month-end → expected, no alert
Worker process not consuming for >5 min → page (worker died)

Dashboard:

Per-queue depth + age over last 24h
Top 5 failing job types
DLQ size trends
Worker resource utilization

Implementation Skeleton

/services/queue/
  src/
    queues.ts             (queue configurations + connections)
    types.ts              (job data type definitions)
  package.json

/services/worker/
  src/
    workers/
      user-fast-worker.ts
      user-medium-worker.ts
      outbound-webhooks-worker.ts
      batch-fast-worker.ts
      batch-slow-worker.ts
    handlers/
      send-email.ts
      generate-thumbnail.ts
      generate-pdf-report.ts
      deliver-webhook.ts
      sync-analytics.ts
      process-stripe-event.ts
    metrics.ts
    server.ts             (worker process bootstrap)
  Dockerfile
  fly.toml

/services/api/  (your existing)
  uses queue.ts to ENQUEUE jobs
  doesn't process them

Worker bootstrap pattern:

// /services/worker/src/server.ts
import { Worker } from 'bullmq';
import { redis } from './redis';
import { handleEmail } from './handlers/send-email';
import { handleThumbnail } from './handlers/generate-thumbnail';
// ... other handlers

const userFastWorker = new Worker(
  'user-fast',
  async (job) => {
    switch (job.name) {
      case 'send-email': return handleEmail(job.data);
      case 'generate-thumbnail': return handleThumbnail(job.data);
      default: throw new Error(`Unknown job ${job.name}`);
    }
  },
  {
    connection: redis,
    concurrency: 10,
  }
);

// ... similar for other queues

userFastWorker.on('failed', (job, err) => {
  logger.error({ jobId: job?.id, error: err }, 'Job failed');
  metrics.increment('queue.failed', { queue: 'user-fast', name: job?.name });
});

userFastWorker.on('completed', (job, result) => {
  metrics.increment('queue.completed', { queue: 'user-fast', name: job.name });
});

Scaling Boundaries

This architecture handles:

Up to ~500K jobs/day comfortably
Up to ~2K/sec sustained
5x your current peak month-end volume

Migration triggers (>3y from now):

Total volume >1M jobs/day → consider sharding queues across Redis Cluster
Need workflow orchestration (multi-step jobs with state) → migrate to Inngest or Temporal
Need exactly-once semantics (BullMQ is at-least-once) → migrate to Temporal
Cross-region replication needs → managed services (AWS SQS, Google Pub/Sub)

What This Architecture Won't Solve

Won't handle exactly-once delivery. BullMQ is at-least-once; idempotency at handler is your defense.
Won't enforce ordering across queues. Within a queue, FIFO. Across queues, no ordering.
Won't replace workflow engines. Long-running multi-step (with checkpoints, compensations) needs Inngest/Temporal.
Won't compensate for poor handler design. If handlers aren't idempotent, retries cause data corruption — queue can't fix.
Won't auto-scale workers. You need to monitor + manually scale (or set up auto-scaling separately).

Migration from Existing

Week 1: Infrastructure

Set up dedicated worker service on Fly.io / Railway
BullMQ queues defined
Move existing image thumbnail BullMQ to new structure

Week 2: Email migration

Replace synchronous Sendgrid with email queue + worker
Idempotency keys at enqueue
Test in staging — verify no email duplicates

Week 3: Stripe webhook migration

Build outbound-webhooks queue
Move Stripe webhook processing to user-medium queue
Receiver does sig verify + enqueue (Receiver pattern from webhook-handler-architect)

Week 4: Reports + analytics

Move PDF report generation to batch-slow
Move analytics sync to batch-fast cron schedule

Week 5: Webhook delivery to customers

Build outbound-webhooks queue with retry
Migrate from no-retry direct send
DLQ + customer notification on permanent failure

Week 6: Observability + DLQ tools

Datadog dashboard
Admin DLQ replay tool
Alert configuration

Maintenance Cadence

Weekly:

DLQ review per queue. Patterns?
Queue depth + age trends — anything degrading?

Monthly:

Worker resource utilization. Time to scale up/down?
Job duration trends — anything getting slower?
Cost review (Redis + worker hosts)

Quarterly:

Architecture audit. New job types added? Queue partitioning still right?
Scale test: replay 3 days of historical jobs to validate capacity

Key Takeaways

5 queues by job class (priority + latency), not by domain. user-fast / user-medium / outbound-webhooks / batch-fast / batch-slow.
BullMQ for your scale + team. Inngest/Temporal only when complex workflows justify.
Move workers off Vercel to dedicated infra. Serverless is wrong for long-running queue workers.
Idempotency keys at job creation + DB-level dedup at handler. Defense in depth.
DLQ inspection tool from day one. Build admin replay UI; don't wait until you need it at 3am.
Migrate in 6 weeks. Email first (highest current pain), then Stripe webhooks, then reports + analytics + outbound webhooks.

Common use cases

Engineer adding async processing to a synchronous service
Backend lead consolidating ad-hoc workers into structured queue infrastructure
Solo founder hitting 'queue gets stuck' problems
Team migrating from SQS to BullMQ (or any queue migration)
Architect designing for high-throughput batch processing
Engineer evaluating Inngest / Trigger.dev / Temporal for workflow orchestration

Best AI model for this

Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'
Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.
Priority lanes prevent slow jobs from blocking fast ones. Don't put a 30-min report job in the same queue as 100ms user-waiting jobs.
Retry with exponential backoff + jitter. Hammer-retry causes downstream cascades.
DLQ inspection tool from day one. 'Why is X stuck?' must be answerable in 30 seconds.
Job concurrency != worker count. 1 worker can process 50 concurrent IO-bound jobs (async). 1 worker per CPU-bound job.
Monitor queue depth + age, not just job rate. A queue with stable depth + growing oldest-job age means workers are falling behind.

Customization tips

List ALL job types with realistic volume + latency. The queue partitioning depends on knowing the full job mix.
Be honest about job shape (IO-bound vs CPU-bound). Concurrency settings differ dramatically; CPU-bound jobs need worker isolation.
Specify reliability per job type. 'Must not lose' vs 'OK to drop' shapes retry policy + DLQ alerts.
Mention infrastructure constraints (existing Redis, AWS-native, serverless-only). Queue choice depends on what you can deploy.
Specify scale targets at multiple time horizons. 1y vs 3y projections shape architecture (BullMQ at 100K/day vs 10M/day differs).
Use the Workflow Orchestration Mode variant if your jobs are multi-step processes (signup → verify → onboard → trigger emails) — Temporal/Inngest beats raw queues for that.

Variants

BullMQ / Redis Mode

For Node.js + Redis stacks — emphasizes BullMQ patterns, Redis Cluster scaling.

AWS SQS Mode

For AWS-native — emphasizes SQS standard vs FIFO, Lambda integration, DLQ patterns.

Workflow Orchestration Mode

For long-running multi-step flows — Inngest / Trigger.dev / Temporal architecture for stateful workflows.

Migration Mode

For migrating from one queue infra to another — emphasizes parallel-run, cutover, rollback.

Frequently asked questions

How do I use the Background Job Queue Designer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Background Job Queue Designer?

Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Background Job Queue Designer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'; Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

Curated by

Promptolis Editorial

Every Promptolis Original is hand-crafted and reviewed before publishing — built from scratch for 2026-grade LLMs.

Last reviewed on 2026-04-28 · About Promptolis