⚡ Promptolis Original · Coding & Development
⚙️ Background Job Queue Designer
Designs your queue + worker architecture: which jobs in which queue, retry policy, priority lanes, the dead-letter queue — picking the right pattern between BullMQ / Sidekiq / SQS / Inngest for YOUR scale.
Why this is epic
Most teams pick a queue library first, then design backwards. Wrong order. This Original picks the right queue infrastructure for YOUR scale, then designs the architecture: queue partitioning, worker pools, retry policies, DLQ, and the priority-lane pattern that keeps user-waiting jobs fast.
Outputs the complete design: queue inventory (which queues, why), worker pool sizing, per-job-type retry + idempotency, priority strategy, DLQ pattern, observability, and the 'when to migrate to a different queue infra' triggers.
Calibrated to 2026 queue reality: BullMQ for Node, Sidekiq for Ruby, Celery for Python, SQS for AWS-native, Inngest for serverless-friendly. Each has trade-offs; the Original picks based on your actual constraints.
Includes the 5 anti-patterns most teams fall into: putting too much in one queue, lacking priority lanes, not using idempotency keys, ignoring the DLQ, retry storms during outages.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<job-types>(1) Send transactional email (welcome, receipt, password reset). (2) Generate PDF reports (monthly customer reports — bulk on month-end). (3) Image thumbnail processing (after upload). (4) Webhook delivery to customer endpoints (we send events to OUR customers' webhooks). (5) Sync to analytics (batch every 15 min). (6) Stripe webhook processing (incoming events from Stripe, processed async).</job-types> <volume>Emails: ~1000/day, peak 5K/day. Reports: ~500 jobs on month-end day, 1 job per customer. Images: ~2K/day. Webhook delivery: ~10K/day. Analytics sync: 96/day (every 15 min). Stripe webhooks: 5K/day, peak 20K on first-of-month.</volume> <latency-tolerance>Emails: <30s for transactional (user expects). Reports: 24h OK. Images: <10s (user uploads, expects to see thumbnail soon). Webhook delivery: <1 min ideal. Analytics sync: scheduled, ±1 min OK. Stripe webhooks: <30s for customer-visible state changes.</latency-tolerance> <job-shape>Emails: IO-bound, ~500ms each. Reports: CPU + IO, 30s-5min each. Images: CPU-bound (thumbnail generation), ~3s each. Webhook delivery: IO-bound, ~1s. Analytics: IO-bound, ~30s per batch. Stripe webhooks: IO-bound, ~500ms.</job-shape> <existing-state>Mixed: emails go through Sendgrid directly synchronously (sometimes timeout). Reports use a setTimeout cron + sync code. Images use a hand-rolled BullMQ setup. Webhooks have no retry. Analytics sync is ad-hoc cron. Stripe webhooks are sync (sometimes timeout).</existing-state> <infrastructure>Node.js 20 + Postgres + Redis (Upstash). Vercel for API. Considering moving workers off Vercel since they're serverless-unfriendly.</infrastructure> <scale-targets>2x growth in 1y. 5x in 3y. So peak ~100K jobs/day at 3y.</scale-targets> <reliability-needs>Emails: must not lose (user expects). Reports: must not lose (compliance). Images: OK to retry user-side if lost. Webhook delivery: must deliver (we promise). Analytics: OK to skip if 1 batch. Stripe webhooks: must process (financial data).</reliability-needs> <team>4 backend engineers. 2 have BullMQ experience. None have Temporal/Inngest experience.</team>
Queue Architecture: B2B SaaS Multi-Job-Type System
Queue Infrastructure Choice
BullMQ on dedicated Redis (Upstash + Fly.io workers). Reasons:
- Team has BullMQ experience; switching costs are real
- Volume (~20K/day, ~100K target) is well within BullMQ capability
- Redis already in stack; no new infrastructure
- Move workers off Vercel onto Fly.io / Railway / dedicated long-running infra
Alternatives considered:
- AWS SQS: would work, but you don't have AWS-native infra; adds vendor lock-in
- Inngest: great for workflow orchestration but overkill for your job types (mostly single-step, not multi-step workflows)
- Temporal: great for complex workflows, but learning curve + complexity not justified for your job mix
Reconsider migration to Inngest or Temporal IF you start needing complex multi-step workflows (saga pattern, long-running approvals, complex state machines). Right now: BullMQ.
Queue Inventory
Queue Priority Latency Target Avg Throughput
─────────────────────────────────────────────────────────────────
user-fast HIGH <10s ~3K/day
(image thumbnails, password reset emails)
user-medium MEDIUM <60s ~6K/day
(transactional emails, Stripe webhook processing)
outbound-webhooks MEDIUM <60s ~10K/day
(delivery to customer webhook URLs — separate due to potentially-slow customer endpoints)
batch-fast LOW <15min ~100/day
(analytics sync, small reports)
batch-slow LOW <24h ~500-2K/day spike
(monthly customer reports, end-of-month bulk)
DLQ n/a n/a accumulator
Why these splits:
user-fastkeeps user-waiting jobs out of the same lane as long-running reportsoutbound-webhooksseparated because customer endpoints can be slow (10s timeout) — don't block other jobsbatch-slowhandles month-end spikes without affecting daily ops
Per-Job-Type Specification
Transactional emails
- Queue:
user-fastfor password resets (user is waiting);user-mediumfor receipts (less urgent) - Idempotency key:
email:${userId}:${templateId}:${eventId}— same key = same email, deduped - Retry policy: 3 attempts, 1min/5min/15min backoff
- Failure: DLQ; daily review by support team
Image thumbnails
- Queue:
user-fast - Idempotency key:
thumb:${imageId}:${size}— keyed by image + size variant - Retry policy: 2 attempts, immediate then 30s backoff
- Failure: DLQ; user can retrigger by re-uploading
- Note: CPU-bound — limit concurrency per worker (1-2 per CPU)
PDF reports (monthly bulk)
- Queue:
batch-slow - Idempotency key:
report:${customerId}:${period}— once per customer-period - Retry policy: 5 attempts, 5min/15min/30min/1h/2h backoff
- Failure: DLQ; alert support
- Note: memory-heavy; size workers accordingly (~512MB each)
Webhook delivery (to customer URLs)
- Queue:
outbound-webhooks - Idempotency key:
webhook:${eventId}— enforced by us, not by customer - Retry policy: 6 attempts over 24h: 1m, 5m, 15m, 1h, 4h, 24h backoff
- Failure: DLQ + customer notification email + UI shows 'webhook failed'
- Timeout: 10s per delivery attempt (customer endpoints may be slow)
- Note: customer-facing — clear error reporting matters
Analytics sync
- Queue:
batch-fast - Idempotency key:
analytics-sync:${batchTimestamp}— one per 15-min window - Retry policy: 3 attempts
- Failure: DLQ + alert; reconcile with next batch
- Schedule: cron at :00, :15, :30, :45 of each hour
Stripe webhook processing
- Queue:
user-medium - Idempotency key:
stripe:${eventId}— Stripe's event.id - Retry policy: 5 attempts, exponential 5s/25s/125s/625s/52min
- Failure: DLQ + alert (financial data — high priority)
- Note: event handlers idempotent at DB level (UPSERT, unique constraint)
Priority Strategy
BullMQ supports priority within a queue + multiple queues. Use both:
Across queues: workers process queues in declared order. user-fast workers checked first, then user-medium, etc.
Within queue: for user-fast and user-medium, use BullMQ's priority option — password reset (priority 1) before receipt (priority 5).
await emailQueue.add('send-email', { ... }, { priority: 1 }); // password reset
await emailQueue.add('send-email', { ... }, { priority: 5 }); // receipt
Lower number = higher priority.
Worker Pool Sizing
Per queue:
user-fast: 2 workers, concurrency 10 each (IO + light CPU). Total: 20 jobs concurrent.user-medium: 2 workers, concurrency 20 each (IO-heavy). Total: 40 concurrent.outbound-webhooks: 1 worker, concurrency 50 (highly IO-bound). Total: 50 concurrent.batch-fast: 1 worker, concurrency 5 (IO + some CPU). Total: 5 concurrent.batch-slow: 2 workers, concurrency 1 each (CPU + memory heavy). Total: 2 concurrent.
Reasoning:
- IO-bound jobs benefit from high concurrency (Node async)
- CPU-bound jobs limit to ~1 per CPU core to avoid contention
- Memory-heavy jobs: low concurrency, dedicated workers
Total worker resources:
- 5 small workers (1 vCPU, 512MB) for fast queues
- 2 medium workers (2 vCPU, 1GB) for batch-slow
- ~$50-80/month on Fly.io / Railway
Retry & DLQ
Backoff strategy (exponential + jitter):
await queue.add('job-name', data, {
attempts: 5,
backoff: {
type: 'exponential',
delay: 5000, // 5s base
},
// BullMQ adds jitter automatically with `removeDelay` randomization
});
DLQ pattern:
- BullMQ's
failedjob state is your DLQ - Don't auto-delete failed jobs
- Build admin tool to inspect + replay
Admin DLQ tool:
// /admin/dlq
GET /admin/dlq?queue=user-fast&since=2026-04-25
// Returns: list of failed jobs with id, name, lastError, failedAt
POST /admin/dlq/replay
body: { jobId: 'abc123' }
// Re-enqueues to original queue
POST /admin/dlq/replay-bulk
body: { queue: 'user-fast', failedSince: '2026-04-28', errorPattern: 'TimeoutError' }
// Bulk replay matching pattern
DELETE /admin/dlq/:jobId
// Permanently delete
Alerts:
- DLQ size >0 for 'must-not-lose' queues (emails, reports, webhooks, Stripe) → page within 30 min
- DLQ size >50 for any queue → Slack alert
- DLQ growth rate >10/hour → Slack alert (something systemic)
- Oldest-failed-job-age >24h → daily digest
Idempotency Pattern
Key strategy by job type listed above. Critical patterns:
// Use BullMQ's jobId for queue-level dedup
await queue.add('job-name', data, {
jobId: idempotencyKey, // BullMQ rejects duplicate jobIds
});
// Defense in depth: handler also idempotent at DB layer
async function processEmail(jobData) {
const dedupKey = `email:${jobData.userId}:${jobData.templateId}:${jobData.eventId}`;
const alreadySent = await db.email_sent.findUnique({ where: { dedupKey } });
if (alreadySent) return { status: 'duplicate' };
// Send email
const result = await sendgrid.send(...);
// Record success (transaction with the actual side effect)
await db.email_sent.create({ dedupKey, sendgridId: result.id });
return { status: 'sent' };
}
Observability
Per queue, track:
queue.depth(gauge: pending jobs)queue.age.oldest(gauge: how stale is the oldest pending job)queue.processed.rate(counter: completed/sec)queue.failed.rate(counter: failed/sec)queue.duration(histogram: job processing time)queue.wait.duration(histogram: time from enqueue to start)
Alert rules:
queue.age.oldestforuser-fast>30s → page (user-waiting backlog)queue.depthforuser-medium>500 sustained 10min → Slack (capacity issue)queue.failed.rate>5% → Slack (something's wrong)queue.depthforbatch-slow>2000 during month-end → expected, no alert- Worker process not consuming for >5 min → page (worker died)
Dashboard:
- Per-queue depth + age over last 24h
- Top 5 failing job types
- DLQ size trends
- Worker resource utilization
Implementation Skeleton
/services/queue/
src/
queues.ts (queue configurations + connections)
types.ts (job data type definitions)
package.json
/services/worker/
src/
workers/
user-fast-worker.ts
user-medium-worker.ts
outbound-webhooks-worker.ts
batch-fast-worker.ts
batch-slow-worker.ts
handlers/
send-email.ts
generate-thumbnail.ts
generate-pdf-report.ts
deliver-webhook.ts
sync-analytics.ts
process-stripe-event.ts
metrics.ts
server.ts (worker process bootstrap)
Dockerfile
fly.toml
/services/api/ (your existing)
uses queue.ts to ENQUEUE jobs
doesn't process them
Worker bootstrap pattern:
// /services/worker/src/server.ts
import { Worker } from 'bullmq';
import { redis } from './redis';
import { handleEmail } from './handlers/send-email';
import { handleThumbnail } from './handlers/generate-thumbnail';
// ... other handlers
const userFastWorker = new Worker(
'user-fast',
async (job) => {
switch (job.name) {
case 'send-email': return handleEmail(job.data);
case 'generate-thumbnail': return handleThumbnail(job.data);
default: throw new Error(`Unknown job ${job.name}`);
}
},
{
connection: redis,
concurrency: 10,
}
);
// ... similar for other queues
userFastWorker.on('failed', (job, err) => {
logger.error({ jobId: job?.id, error: err }, 'Job failed');
metrics.increment('queue.failed', { queue: 'user-fast', name: job?.name });
});
userFastWorker.on('completed', (job, result) => {
metrics.increment('queue.completed', { queue: 'user-fast', name: job.name });
});
Scaling Boundaries
This architecture handles:
- Up to ~500K jobs/day comfortably
- Up to ~2K/sec sustained
- 5x your current peak month-end volume
Migration triggers (>3y from now):
- Total volume >1M jobs/day → consider sharding queues across Redis Cluster
- Need workflow orchestration (multi-step jobs with state) → migrate to Inngest or Temporal
- Need exactly-once semantics (BullMQ is at-least-once) → migrate to Temporal
- Cross-region replication needs → managed services (AWS SQS, Google Pub/Sub)
What This Architecture Won't Solve
- Won't handle exactly-once delivery. BullMQ is at-least-once; idempotency at handler is your defense.
- Won't enforce ordering across queues. Within a queue, FIFO. Across queues, no ordering.
- Won't replace workflow engines. Long-running multi-step (with checkpoints, compensations) needs Inngest/Temporal.
- Won't compensate for poor handler design. If handlers aren't idempotent, retries cause data corruption — queue can't fix.
- Won't auto-scale workers. You need to monitor + manually scale (or set up auto-scaling separately).
Migration from Existing
Week 1: Infrastructure
- Set up dedicated worker service on Fly.io / Railway
- BullMQ queues defined
- Move existing image thumbnail BullMQ to new structure
Week 2: Email migration
- Replace synchronous Sendgrid with email queue + worker
- Idempotency keys at enqueue
- Test in staging — verify no email duplicates
Week 3: Stripe webhook migration
- Build outbound-webhooks queue
- Move Stripe webhook processing to user-medium queue
- Receiver does sig verify + enqueue (Receiver pattern from webhook-handler-architect)
Week 4: Reports + analytics
- Move PDF report generation to batch-slow
- Move analytics sync to batch-fast cron schedule
Week 5: Webhook delivery to customers
- Build outbound-webhooks queue with retry
- Migrate from no-retry direct send
- DLQ + customer notification on permanent failure
Week 6: Observability + DLQ tools
- Datadog dashboard
- Admin DLQ replay tool
- Alert configuration
Maintenance Cadence
Weekly:
- DLQ review per queue. Patterns?
- Queue depth + age trends — anything degrading?
Monthly:
- Worker resource utilization. Time to scale up/down?
- Job duration trends — anything getting slower?
- Cost review (Redis + worker hosts)
Quarterly:
- Architecture audit. New job types added? Queue partitioning still right?
- Scale test: replay 3 days of historical jobs to validate capacity
Key Takeaways
- 5 queues by job class (priority + latency), not by domain. user-fast / user-medium / outbound-webhooks / batch-fast / batch-slow.
- BullMQ for your scale + team. Inngest/Temporal only when complex workflows justify.
- Move workers off Vercel to dedicated infra. Serverless is wrong for long-running queue workers.
- Idempotency keys at job creation + DB-level dedup at handler. Defense in depth.
- DLQ inspection tool from day one. Build admin replay UI; don't wait until you need it at 3am.
- Migrate in 6 weeks. Email first (highest current pain), then Stripe webhooks, then reports + analytics + outbound webhooks.
Common use cases
- Engineer adding async processing to a synchronous service
- Backend lead consolidating ad-hoc workers into structured queue infrastructure
- Solo founder hitting 'queue gets stuck' problems
- Team migrating from SQS to BullMQ (or any queue migration)
- Architect designing for high-throughput batch processing
- Engineer evaluating Inngest / Trigger.dev / Temporal for workflow orchestration
Best AI model for this
Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Pro tips
- One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'
- Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.
- Priority lanes prevent slow jobs from blocking fast ones. Don't put a 30-min report job in the same queue as 100ms user-waiting jobs.
- Retry with exponential backoff + jitter. Hammer-retry causes downstream cascades.
- DLQ inspection tool from day one. 'Why is X stuck?' must be answerable in 30 seconds.
- Job concurrency != worker count. 1 worker can process 50 concurrent IO-bound jobs (async). 1 worker per CPU-bound job.
- Monitor queue depth + age, not just job rate. A queue with stable depth + growing oldest-job age means workers are falling behind.
Customization tips
- List ALL job types with realistic volume + latency. The queue partitioning depends on knowing the full job mix.
- Be honest about job shape (IO-bound vs CPU-bound). Concurrency settings differ dramatically; CPU-bound jobs need worker isolation.
- Specify reliability per job type. 'Must not lose' vs 'OK to drop' shapes retry policy + DLQ alerts.
- Mention infrastructure constraints (existing Redis, AWS-native, serverless-only). Queue choice depends on what you can deploy.
- Specify scale targets at multiple time horizons. 1y vs 3y projections shape architecture (BullMQ at 100K/day vs 10M/day differs).
- Use the Workflow Orchestration Mode variant if your jobs are multi-step processes (signup → verify → onboard → trigger emails) — Temporal/Inngest beats raw queues for that.
Variants
BullMQ / Redis Mode
For Node.js + Redis stacks — emphasizes BullMQ patterns, Redis Cluster scaling.
AWS SQS Mode
For AWS-native — emphasizes SQS standard vs FIFO, Lambda integration, DLQ patterns.
Workflow Orchestration Mode
For long-running multi-step flows — Inngest / Trigger.dev / Temporal architecture for stateful workflows.
Migration Mode
For migrating from one queue infra to another — emphasizes parallel-run, cutover, rollback.
Frequently asked questions
How do I use the Background Job Queue Designer prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Background Job Queue Designer?
Claude Opus 4. Queue architecture needs reasoning about throughput, ordering, failure modes, and infrastructure constraints — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Can I customize the Background Job Queue Designer prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: One queue per job class (priority, latency tolerance), not per business domain. 'high-priority' + 'low-priority' beats 'emails' + 'reports' + 'notifications.'; Idempotency keys at job creation. If retried, same key = same job. Prevents duplicate processing.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals