Zero-downtime deploys for Node.js: blue-green, canaries, and cluster mode

Zero-downtime deployment stopped being a premium concern around 2018 — the tooling is commodity now, and the patterns are well understood. What trips teams up in 2026 is not the deployment mechanism but the application-layer details underneath it: graceful SIGTERM handling, honest health checks, draining in-flight requests, and the WebSocket connections that do not magically reconnect when a process restarts. This piece walks through three patterns that cover most Node.js production workloads — PM2 cluster reload for single-box deploys, blue-green on Fly or Nginx for multi-machine setups, and canary traffic splits when risk demands a slower rollout. Each has a sweet spot and a set of failure modes worth knowing before the first incident.

The three patterns and when to use them

Every zero-downtime deploy is a variation on the same trick: start the new version, confirm it works, shift traffic to it, stop the old version. The patterns differ in how gradual that shift is and how much infrastructure they need.

Pattern	How it works	Best for	Tradeoffs
PM2 cluster reload	Rolling restart across worker processes on a single machine	Monoliths on a VPS, early-stage SaaS	Single point of failure, no cross-machine safety
Blue-green	Full new environment boots, traffic switches when healthy	Multi-machine deploys, platforms like Fly or K8s	Briefly doubles infrastructure cost during deploy
Canary	Small traffic slice to new version, widened if healthy	High-risk changes, large user bases, regulated workloads	Requires traffic splitting, version-aware routing, more orchestration
Rolling update	Replace instances N at a time, one machine down briefly	Kubernetes default, cluster-of-identical-nodes setups	Mid-deploy mixed versions, short windows of reduced capacity

Most SaaS products never need more than PM2 reload plus a standby machine. Blue-green and canary matter once the business lives on the deploy pipeline — at that point the extra complexity buys real insurance.

PM2 cluster reload — the 80% pattern

PM2's cluster mode runs multiple worker processes of the same Node.js app behind the same port. On reload, PM2 spawns a new worker with the fresh code, waits for it to start listening, then sends SIGINT to one of the old workers — repeating the dance until every worker has been replaced. The old workers get a grace window to drain in-flight requests before PM2 terminates them. On a four-core box with four workers, the app serves traffic continuously throughout the reload.

# One-time setup — run the app in cluster mode across all cores
pm2 start server.js -i max --name api

# Deploy a new version with zero downtime
git pull && npm ci --production
pm2 reload api --update-env

# Wait, and verify
pm2 status
pm2 logs api --lines 50

The reload command is the whole deploy. What makes it actually zero-downtime is the graceful shutdown handler in the app — without it, PM2 kills workers mid-request and users see 502s during deploys.

// server.ts — the shutdown handler that makes PM2 reload work
import http from "node:http";
import { app } from "./app";

const server = http.createServer(app);
server.listen(process.env.PORT || 3000);

let shuttingDown = false;

// PM2 sends SIGINT on reload; SIGTERM comes from orchestrators
for (const signal of ["SIGINT", "SIGTERM"] as const) {
  process.on(signal, () => {
    if (shuttingDown) return;
    shuttingDown = true;

    // Stop accepting new connections, finish in-flight ones
    server.close(() => {
      // Close DB pool, flush logs, drain queues
      void cleanup().then(() => process.exit(0));
    });

    // Hard exit if cleanup hangs — PM2 kills after 1600ms by default
    setTimeout(() => process.exit(1), 5_000).unref();
  });
}

// Health check should fail as soon as shutdown begins
app.get("/health", (_, res) => {
  res.status(shuttingDown ? 503 : 200).send(shuttingDown ? "draining" : "ok");
});

PM2's default kill timeout is 1600ms. If the app has long-running requests — report generation, LLM streaming, file uploads — increase PM2_KILL_TIMEOUT or the reload turns into a 1.6-second outage per worker. Test with realistic traffic before trusting it in production.

Blue-green — the safe upgrade

Blue-green stands up a complete second copy of the environment — 'green' — alongside the live one — 'blue'. Green boots with the new code, runs migrations, passes health checks against production data (or a read replica), then traffic flips. If anything breaks, the flip reverses in seconds. The cost is running two environments briefly; the benefit is atomic, reversible deploys.

On Fly, blue-green is a one-line configuration change — Fly boots the new Machines alongside the old ones, waits for health checks, then cuts traffic over. On bare Nginx, it's a two-upstream config with a single-line swap.

# fly.toml — one-line blue-green
[deploy]
  strategy = "bluegreen"

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true

# Health check must reflect real readiness
[[http_service.checks]]
  interval = "10s"
  timeout = "2s"
  grace_period = "5s"
  method = "GET"
  path = "/health"

# Nginx blue-green on a VPS — two upstreams, one active
upstream blue  { server 127.0.0.1:3001; }
upstream green { server 127.0.0.1:3002; }

# The single line that flips production
upstream active { server 127.0.0.1:3001; }  # or :3002

server {
  listen 443 ssl http2;
  location / {
    proxy_pass http://active;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
  }
}

The deploy script boots green on port 3002, runs smoke tests, rewrites the upstream block to point at green, and reloads Nginx with SIGHUP — which drains existing connections gracefully. On failure, revert the one line and reload again. The whole operation is atomic from the user's perspective and recoverable within a minute.

Canary — when risk demands it

Canary deploys send a small slice of traffic — typically 1% to 5% — to the new version, monitor error rates and latency, and widen the slice only if metrics stay clean. The pattern shines for high-risk changes on large user bases: database migrations, payment flow edits, AI model swaps. The operational cost is real — traffic splitting, version-aware observability, and a rollback path — but the blast radius of a bad deploy drops from 'everyone' to 'the first 2% to click refresh'.

Three common canary shapes in 2026: Fly's native canary strategy (one new Machine first, rolling deploy after), a load balancer with weighted upstreams, or feature flags wrapping the new code path. Feature flags are often the simplest path — the old code and new code ship in the same binary, and a runtime switch controls the split.

// Canary via feature flag — simplest approach for most teams
import { isFlagEnabled } from "./flags";

app.post("/checkout", async (req, res) => {
  // New payment flow rolled out to 5% of users
  const userId = req.user.id;
  const useNewFlow = await isFlagEnabled("checkout_v2", { userId });

  if (useNewFlow) {
    return handleCheckoutV2(req, res);
  }
  return handleCheckoutV1(req, res);
});

Canary only works if observability can distinguish new-version traffic from old. Tag requests with a version label in logs, metrics, and traces from day one — retrofitting this later is painful.

Draining in-flight requests

The single most common cause of 'zero-downtime deploys' actually dropping requests is a shutdown handler that ignores in-flight work. The pattern is always the same: stop accepting new connections, finish what is running, then exit. Node's http.Server.close() does the first two; the third needs a timeout escape hatch in case a request hangs.

HTTP requests drain naturally via server.close() — new connections are rejected, active ones complete.
WebSocket connections do not drain automatically. Send a close frame and give clients a few seconds to reconnect against the new version before force-closing.
Background jobs need a separate drain. BullMQ workers should await worker.close() before exit — otherwise jobs get stranded mid-execution.
Database pools should close after HTTP traffic stops, not before, or in-flight queries fail mid-shutdown.

A health check that returns 200 'OK' the instant the process starts is lying. Health checks should only pass after the database pool is connected, migrations have run, and the process can actually serve a real request. Orchestrators that trust optimistic health checks flip traffic into a broken version and cause exactly the outage they were meant to prevent.

Health checks that do not lie

A production-grade health check has three layers. A liveness probe that only returns 500 if the process is wedged and needs a kill. A readiness probe that returns 503 during startup, during shutdown, or when downstream dependencies are unreachable — this is the one load balancers watch. And optionally a deep health endpoint that exercises actual database queries, for monitoring dashboards rather than deploy gates.

// Three-layer health check — the pattern that holds up in production
app.get("/live", (_, res) => res.status(200).send("ok"));

app.get("/ready", async (_, res) => {
  if (shuttingDown) return res.status(503).send("draining");
  if (!dbPool.connected) return res.status(503).send("db not ready");
  return res.status(200).send("ok");
});

app.get("/health/deep", async (_, res) => {
  try {
    await db.query("SELECT 1");
    await redis.ping();
    return res.status(200).json({ db: "ok", redis: "ok" });
  } catch (err) {
    return res.status(503).json({ error: String(err) });
  }
});

What we recommend by stage

Stage matters more than technology here. For a pre-product-market-fit SaaS, PM2 reload on a single VPS plus honest shutdown handling is genuinely enough — spending a sprint on canary infrastructure before there are customers to canary against is the wrong bet. Once the product has paying users and shipping weekly, move to blue-green on Fly or a managed platform — the strategy flag is one line and atomic rollback pays for itself the first bad deploy. Canary becomes worth the complexity once the user base crosses a size where even a 1% regression is hundreds of affected users, or when changes touch payments, auth, or billing.

Key takeaways

Zero-downtime deploys are an application-layer problem more than an infrastructure one. Graceful SIGTERM handling, honest readiness probes, and explicit request draining do 80% of the work.
PM2 cluster reload is enough for most single-box Node.js apps, provided the shutdown handler actually drains in-flight requests.
Blue-green on Fly or Nginx is the next step up — one line of config for atomic, reversible deploys.
Canary deploys earn their complexity on high-risk changes and large user bases. Feature flags are often the simplest path.
Health checks that return 200 before the app is actually ready are the most common cause of deploy-triggered outages. Make readiness probes honest.

#nodejs#deployments#pm2#blue-green#canary#devops#zero-downtime

Zero-downtime deploys for Node.js: blue-green, canaries, and cluster mode

The three patterns and when to use them

PM2 cluster reload — the 80% pattern

Blue-green — the safe upgrade

Canary — when risk demands it

Draining in-flight requests

Health checks that do not lie

What we recommend by stage

Key takeaways

Related posts

Feature flags, gradual rollouts, and experimentation for SaaS teams

Observability for SaaS: metrics, logs, traces, and the tools that matter

Deploying Node.js at scale: PM2 vs Kubernetes vs Fly.io vs Vercel

Let's build it together.