Building rate limiters that scale: Redis, sliding windows, and token buckets

Rate limiting is one of those backend primitives that looks simple on a whiteboard and turns out to be a minefield in production. A single-node counter in memory works until the second server boots. A naive INCR in Redis works until a burst crashes into a window boundary. And the moment your traffic starts mattering to the business, every bug in your limiter becomes a user-facing outage or a budget-eating runaway. This guide walks through the four algorithms worth knowing, the Redis patterns that make them correct across nodes, and the traps that catch teams the first time they ship rate limiting at real volume.

The four algorithms, compared

There are really only four rate-limiting algorithms in common production use. Everything else is a variant or a composition of these. Picking the right one is the first decision — and it's almost always a tradeoff between burst tolerance, implementation complexity, and memory cost.

Algorithm	Burst behavior	Memory per key	Fairness	Best for
Fixed window	Allows 2x burst at window edges	1 counter	Low	Coarse throttling, low-stakes endpoints
Sliding window log	Strictly enforced	One entry per request	High	Low-volume APIs where accuracy matters
Sliding window counter	Strictly enforced (approximated)	2 counters	High	General-purpose public APIs
Token bucket	Configurable burst up to bucket size	Tokens + timestamp	Medium	Bursty workloads, user-facing quotas
Leaky bucket	No burst — steady drain	Queue length + timestamp	High	Traffic shaping, downstream protection

Fixed windows are the simplest and the worst behaved — a user can fire the full quota in the last second of one window and the full quota again in the first second of the next. The sliding window log is the most accurate but the most expensive, because every request writes an entry. The sliding window counter approximates it with two counters and is what most production APIs actually ship. Token bucket is the right pick when you want to reward good citizens with burst capacity. Leaky bucket is the right pick when what you're protecting downstream cannot absorb a spike at all.

Token bucket and leaky bucket are mathematically equivalent under the same parameters — the difference is whether excess requests get rejected immediately (token) or queued and drained at a constant rate (leaky). Pick based on whether your product should say "try again" or "please hold."

Why Redis, and why Lua

Once you have more than one application server, rate limiter state has to live somewhere shared. Redis is the default answer for three reasons: it's fast enough that a limiter check adds roughly a millisecond to request latency, it has the primitives (INCR, EXPIRE, sorted sets, hashes) that map cleanly to every algorithm above, and it supports atomic Lua scripts that let you do the whole check-and-update sequence in a single round trip.

The Lua piece matters more than people realize. A sliding window implemented as four separate Redis calls — prune, count, compare, add — is a race condition waiting to happen. Under load, two requests can both pass the count check before either of them writes. Put the same four steps in a Lua script and Redis runs them as one atomic operation. No race, no double-spend, no off-by-one.

A sliding window in 20 lines of Lua

Here's the pattern we reach for first when a client needs a correct, distributed rate limiter. A Redis sorted set holds one entry per request, scored by timestamp. The Lua script prunes anything outside the window, counts what remains, and either rejects or admits the new request. All five operations happen atomically inside Redis.

// Sliding window rate limiter — Node.js + ioredis + Lua
const SLIDING_WINDOW = `
local key       = KEYS[1]
local now       = tonumber(ARGV[1])
local windowMs  = tonumber(ARGV[2])
local limit     = tonumber(ARGV[3])
local reqId     = ARGV[4]

-- drop entries that fall outside the window
redis.call("ZREMRANGEBYSCORE", key, 0, now - windowMs)

-- count what's left
local count = redis.call("ZCARD", key)
if count >= limit then
  return { 0, count }
end

-- admit the request; timestamp is both the score and (uniquified) member
redis.call("ZADD", key, now, reqId)
redis.call("PEXPIRE", key, windowMs)
return { 1, count + 1 }
`;

export async function allow(
  redis: Redis,
  userId: string,
  limit = 60,
  windowMs = 60_000,
) {
  const key = `rl:${userId}`;
  const reqId = `${Date.now()}-${crypto.randomUUID()}`;
  const [allowed, count] = (await redis.eval(
    SLIDING_WINDOW,
    1,
    key,
    Date.now().toString(),
    windowMs.toString(),
    limit.toString(),
    reqId,
  )) as [number, number];
  return { allowed: allowed === 1, remaining: Math.max(0, limit - count) };
}

Two things to notice. First, the request ID is timestamp plus UUID — not just timestamp. Two requests landing in the same millisecond would collide on a timestamp-only score and one would silently disappear from the set. Second, the PEXPIRE fires on every admit so the key cleans itself up when traffic stops; there's no cron job watching for stale keys.

The clock-skew trap: Date.now() on your application server is not the same as the time inside Redis. If your app nodes drift by seconds (and they will, especially under load or after a cold start), window boundaries become inconsistent. Use redis.call('TIME') inside the Lua script as the source of truth, or accept that your windows are approximate and set limits with a 5-10% safety margin.

Picking the right key: per-user, per-IP, per-route

The algorithm is only half the design. The other half is what you hash into the Redis key. Get this wrong and you either punish innocent users or leave attack vectors wide open.

Per-authenticated-user: the default for anything behind login. Attach the limit to the user ID, not the session, so a malicious client rotating sessions still hits the same ceiling.
Per-IP: mandatory for unauthenticated endpoints (signup, login, password reset). Be aware that corporate NATs and mobile carriers will route thousands of legitimate users through a single IP — set the limit accordingly and plan for allowlists.
Per-route: layer a cheap bucket on hot endpoints (password reset, search, AI endpoints) on top of the user bucket. Someone hammering /api/search shouldn't be able to take down /api/checkout.
Composite: the strongest protection combines all three — per-user for fairness, per-IP for unauthenticated defense, per-route for blast-radius containment. Run them in parallel and reject if any one limit is breached.

Token buckets when you want to reward good citizens

Sliding windows are strict. A user at 60/minute who sends 61 in one minute gets rejected, even if they've been idle for the previous hour. Token buckets let you bank unused capacity and spend it as a burst — which is often what humans actually expect from a rate limit.

The state is just two numbers per key: current token count and last-refill timestamp. On each request, calculate how many tokens have been added since the last refill (rate × elapsed), clamp to the bucket size, subtract one if admitting. All in a Lua script, as ever, to keep it atomic. Anthropic's tier system, AWS API Gateway, and most public APIs with quotas use a token-bucket or token-bucket-plus-sustained-rate hybrid for exactly this reason.

Stampedes, hot keys, and the fairness problem

Three failure modes show up once you're past toy-traffic volumes, and none of them are obvious until they bite.

Stampedes happen when cached responses expire for many users at once and every one of them bursts through the limiter toward the origin. Stagger cache TTLs with jitter, and if the limiter is your last line of defense, add request coalescing (single-flight) in front of expensive endpoints.
Hot keys happen when one tenant or one route accounts for more limiter traffic than a single Redis shard can handle. A single limiter key will pin to one CPU core in Redis; past a few tens of thousands of ops per second, you need to shard by a hash of (key, bucket_id).
Fairness breaks when you rate-limit globally but your traffic distribution is heavily skewed. A 100 req/s global limit shared by a thousand tenants will be consumed almost entirely by the top five. Either set per-tenant limits, or bucket tenants into tiers with independent quotas.

When a limiter rejects, return HTTP 429 with a Retry-After header and an X-RateLimit-Remaining / X-RateLimit-Reset pair. Well-behaved clients will back off; badly-behaved ones reveal themselves in your logs. Both outcomes are useful.

What we actually ship

For most production SaaS apps we build, the stack is: sliding window counter in Redis as the primary limiter, per-user keys for authenticated traffic, per-IP keys for anonymous endpoints, and a separate token-bucket quota for anything that costs real money (AI endpoints, outbound emails, payment attempts). The whole thing sits behind an Express or Hono middleware that rejects with 429 and emits OpenTelemetry metrics so we can see limiter hit rates per route in Grafana. Total code surface area is under 200 lines. It does not break.

When we need more than Redis can give us — sustained >100K ops/s, regional failover, or spend-based quotas that cross service boundaries — we reach for a managed service like Upstash Ratelimit or an edge-side limiter at the CDN layer. Neither is necessary for 95% of SaaS products, and both add complexity that's easy to regret.

Key takeaways

Pick the algorithm for the behavior you want users to experience: fixed window is coarse, sliding window is strict, token bucket rewards good citizens, leaky bucket protects fragile downstreams.
Redis plus Lua is the right primitive for distributed limiters at the scale most SaaS products operate. Atomicity is not optional — four separate calls is a race condition.
Clock skew between app nodes is real and will misalign your windows. Source time from Redis or build in a safety margin.
Per-user, per-IP, and per-route limits are layers, not alternatives. Run all three and reject on the first breach.
Stampedes, hot keys, and fairness skew are the failure modes that catch you past toy volume. Plan for them before you need to.
Return 429 with Retry-After and quota headers. Clients that honor them help you; clients that don't identify themselves.

#rate-limiting#redis#backend#api#scalability#distributed-systems

Building rate limiters that scale: Redis, sliding windows, and token buckets

The four algorithms, compared

Why Redis, and why Lua

A sliding window in 20 lines of Lua

Picking the right key: per-user, per-IP, per-route

Token buckets when you want to reward good citizens

Stampedes, hot keys, and the fairness problem

What we actually ship

Key takeaways

Related posts

Job queues in Node.js: BullMQ, Inngest, Trigger.dev, or build your own?

Observability for SaaS: metrics, logs, traces, and the tools that matter

Webhook security: HMAC verification, replay protection, and idempotency

Let's build it together.