Every Node.js backend of any size eventually needs a queue. Emails, webhooks, AI calls, image processing, nightly reports — the list of work that should not block an HTTP request only grows. The question is which queue, and the answer matters because the decision ties you to an operational model for years. Self-hosted Redis, managed event platform, long-running durable tasks, a humble Postgres table, or SQS — each wins in different places and fails in others. This guide walks through the five options that matter in 2026, the patterns that make them work, and the framework we actually use to pick on client projects.
The five contenders
These aren't the only queues in existence — there's Kafka, RabbitMQ, Temporal, Hatchet, and a long tail of regional cloud services — but five options cover the realistic shortlist for a Node.js SaaS team in 2026.
| Option | Model | Hosting | Best at | Watch out for |
|---|---|---|---|---|
| BullMQ | Redis-backed job queue | Self-hosted | High throughput, full control, mature tooling | You own the Redis |
| Inngest | Event-driven step functions | Managed (SaaS) | Fan-out, scheduled events, serverless-native | Vendor lock-in, step-function learning curve |
| Trigger.dev | Durable long-running tasks | Managed or self-host | Tasks that take minutes to hours, plain async code | Younger ecosystem than BullMQ |
| Postgres (pg-boss / graphile-worker) | SKIP LOCKED queue in your DB | Wherever your Postgres is | Low ops surface, transactional enqueue | Throughput ceiling around 100-200 jobs/sec |
| AWS SQS | Managed distributed queue | Managed (AWS) | Cheap, durable, massive scale, decoupled services | No scheduling, minimal tooling |
BullMQ has roughly 14 million monthly npm downloads and is the most battle-tested option in the Node ecosystem. That doesn't mean it's the right pick for you — it means the rough edges are well-documented and the answers to your questions are already on GitHub.
Start with the simplest thing that could work
If the product is pre-revenue or sub-10K jobs per day, a Postgres table with pg-boss or graphile-worker is almost always the right answer. You already have Postgres. Jobs live in the same transaction as the user action that created them (enqueue and insert, commit together or not at all — a property that Redis-based queues cannot give you). Observability is a SQL query. There's no extra service to monitor, no extra volume to back up, no extra thing to break at 3am.
The right first queue is almost always the one that adds no new operational surface. A Postgres-based queue turns "we need background jobs" from a week-long infrastructure project into a Monday-afternoon task. Upgrade when you actually need to, not before.
The ceiling on Postgres queues is around 100-200 jobs per second before lock contention starts hurting your main database. Below that, the tradeoffs favor Postgres. Above it, BullMQ or a managed service starts making sense.
BullMQ — the workhorse
BullMQ is the default pick once throughput matters, you already run Redis, and you want full control. It handles thousands of jobs per second per Redis instance, has first-class support for priorities, delays, repeatable (cron) jobs, parent-child workflows, and rate limits. The ecosystem includes BullBoard (an admin UI that ships in under 20 lines) and integrations for every major framework.
// BullMQ worker — the pattern we ship on most projects
import { Queue, Worker, QueueEvents } from "bullmq";
const connection = { host: process.env.REDIS_HOST!, port: 6379 };
export const emailQueue = new Queue<EmailJob>("email", { connection });
// Producer — enqueue from anywhere in the app
await emailQueue.add(
"welcome",
{ userId, template: "welcome" },
{
attempts: 5,
backoff: { type: "exponential", delay: 2_000 },
removeOnComplete: { count: 1_000, age: 24 * 3600 },
removeOnFail: { count: 5_000 },
},
);
// Worker — runs in its own process
const worker = new Worker<EmailJob>(
"email",
async (job) => {
await sendEmail(job.data);
},
{ connection, concurrency: 20, limiter: { max: 50, duration: 1_000 } },
);
worker.on("failed", (job, err) => {
logger.error({ jobId: job?.id, err }, "email job failed");
});
// Event stream for dashboards and alerts
new QueueEvents("email", { connection }).on("failed", ({ jobId, failedReason }) => {
metrics.increment("email.queue.failed", { reason: failedReason });
});The pattern above is most of what you need in production. Five attempts with exponential backoff is a sensible default for anything touching the network. removeOnComplete and removeOnFail keep Redis memory bounded — without them, a month later you'll discover that a finished-jobs list has eaten 12 GB of RAM. Concurrency and limiter together give you a worker that can absorb bursts without DoS-ing downstream services.
Redis is the single point of failure for BullMQ. If your Redis loses data (default persistence is weaker than Postgres), you lose jobs. Turn on both RDB and AOF persistence for any queue you care about, and test your failover before you need it.
Inngest — when the programming model matters more than the infra
Inngest flips the model. Instead of enqueueing jobs, you send events. Functions subscribe to events and run as HTTP endpoints on your existing platform (Vercel, Netlify, Fly, whatever). Inngest's hosted orchestrator handles retries, sleeps, fan-out, and durability. There's no worker process to run and no Redis to babysit.
The killer feature is step functions. A function can call step.run for each stage of work, and Inngest persists the result of each step. If the function fails partway through, the retry resumes from the last completed step — no duplicated side effects, no half-done workflows. step.sleep lets a function pause for days and then continue. For flows that span multiple services or wait on human input (onboarding drips, approval workflows, trial-to-paid transitions), this model is a much better fit than raw job queues.
The tradeoffs: you're on a managed service, your workflows are locked to Inngest's SDK, and debugging step functions takes a weekend to get used to. For products that fit the event-driven shape, it pays for itself quickly. For a product that just needs to send emails in the background, it's overkill.
Trigger.dev — the long-running task specialist
Trigger.dev v3 solves a specific problem that BullMQ and Inngest both struggle with: tasks that legitimately take minutes to hours to complete. Video processing, long AI pipelines, multi-step data imports, scheduled reports that touch the whole database. Trigger runs these on dedicated workers that don't cold-start and don't time out at 60 seconds the way serverless platforms do.
The programming model is plain async functions — no step wrappers, no event schemas to design up front. You write code that looks like code. Trigger is Apache-2.0 licensed and can be self-hosted for free with unlimited runs, which is unusual in the managed-queue space. If your workload has long tasks and you'd rather not run Redis yourself, it's the first thing we'd try.
SQS — when you're already on AWS and scale is the problem
SQS is the boring, correct answer when you need cross-service decoupling at AWS scale. It's cheap (fractions of a cent per thousand messages), durable (multi-AZ by default), and scales to millions of messages without a second thought. The tradeoffs: no native scheduling (you'll pair it with EventBridge), no rich admin UI (you'll build one or live in CloudWatch), and polling-based consumption that adds latency most Node.js SaaS teams don't need to accept.
For an event backbone between microservices, SQS plus SNS plus Lambda is still an excellent stack. For in-process background jobs inside a monolith, it's more plumbing than the problem deserves.
Retries, backoff, and idempotency — the patterns that matter
The queue you pick matters less than whether your jobs are retry-safe. Every queue retries on failure. If your job isn't idempotent, retries will double-charge cards, send duplicate emails, or corrupt state. Three patterns solve most of it.
- Exponential backoff with jitter: attempts spaced 2s, 4s, 8s, 16s (plus random jitter). Without jitter, retries synchronize into a thundering herd that hits your failing service every 2 seconds in lockstep.
- Idempotency keys: every job carries a unique key (the job ID works). The operation checks a deduplication table before committing — already-processed keys are a no-op success. Required for any job that calls an external API with side effects.
- Dead-letter queues: after N attempts, move the job to a DLQ instead of retrying forever. Alert on DLQ size, not on individual failures. A DLQ with 10,000 jobs in it is the signal; a single failed job is noise.
Observability — what to watch
A queue without metrics is a black box that eats your jobs. The minimum set of signals worth tracking, whatever queue you pick: depth (jobs waiting), age of oldest waiting job, in-flight count, failure rate, and DLQ size. Alert on age and DLQ size, not on depth — depth spikes during bursts are normal; jobs that have been waiting 10 minutes are not.
The single most useful dashboard panel for a queue is a line chart of oldest-waiting-job-age. It tells you more than queue depth, failure rate, or throughput combined. If that number is climbing, your workers can't keep up — everything else is a detail.
The decision framework
- Under 10K jobs/day and you have Postgres? Use pg-boss or graphile-worker. Stop looking.
- High throughput, self-hosted, already running Redis? BullMQ. It's the default for a reason.
- Event-driven product, serverless platform, want step functions and retries handled for you? Inngest.
- Long-running tasks (minutes to hours), prefer plain async code? Trigger.dev.
- Cross-service decoupling at AWS scale? SQS plus SNS.
- None of the above — you need custom priority logic, multi-region replication, exactly-once delivery? Consider Temporal or Hatchet and budget engineering time accordingly.
Key takeaways
- Start with the smallest ops footprint that works. Postgres-backed queues solve 80% of use cases with zero new infrastructure.
- BullMQ is the mature, high-throughput default once Postgres runs out of room. Configure persistence, retention, and concurrency deliberately.
- Inngest and Trigger.dev are different answers to different questions. Inngest is about event-driven orchestration; Trigger is about long-running durable code.
- Retry strategy, idempotency keys, and dead-letter queues matter more than queue choice. Every queue retries; only you can make retries safe.
- Watch oldest-waiting-job-age before anything else. It's the earliest signal that your queue is breaking.
- Upgrade when you hit a real limit, not when you read a blog post. Premature infrastructure is a tax on delivery.