Feature flags, gradual rollouts, and experimentation for SaaS teams

Feature flags look like one thing and behave like three. A kill switch protects production from a deploy gone wrong. A progressive rollout limits blast radius while a feature reaches more users. An experiment splits traffic to measure whether a change actually moves the needle. These three use cases have different lifespans, different governance needs, and different stakeholders — and conflating them is how SaaS codebases end up with thousands of undocumented booleans that nobody dares delete. This piece covers the four-way tradeoff between build-it-yourself and the major vendors, the patterns for each flag type, and the hygiene that keeps flag debt from eating engineering time.

The three flag archetypes

Before picking a tool, separate the use cases. Every flag in the system belongs in exactly one of three buckets, and each bucket has a different exit plan.

Kill switches live for the lifetime of the feature they protect. Their job is to turn a risky path off fast, ideally from a dashboard that a non-engineer can reach during an incident. Lifespan: effectively permanent, but reviewed annually.
Progressive rollouts live for weeks. They start a feature at 1% of traffic and ramp to 100% as metrics stay green. Once at 100%, the flag and the old branch get removed in the next sprint. Lifespan: weeks.
Experiments live for the length of the test — typically 2 to 6 weeks. They split users into variants, track conversion or engagement metrics, and either ship the winner or keep the control. Lifespan: bounded by statistical power.

If a flag can't say which bucket it belongs to, it's debt in waiting. Require a bucket, an owner, and an expected removal date as fields on every new flag — and block merges if those fields are missing.

The vendor landscape in 2026

The space consolidated around five serious options. Each has a real fit — none of them is the right answer for everyone. The comparison below assumes a SaaS product with 100k to 1M active users and an engineering team of 5 to 30.

Tool	Strength	Shape	Good when	Cost profile
LaunchDarkly	Enterprise governance, approval workflows, mature SDKs	Closed-source SaaS	You need SOC 2 boilerplate, role-based flag changes, and scheduled releases out of the box	High — priced per MAU, adds up at scale
Unleash	Open source, self-hostable, infrastructure-style controls	Open source + hosted	You want flags-as-infra, self-hosting, or strict data-residency	Free self-hosted; reasonable hosted tiers
GrowthBook	Warehouse-native experimentation, CDN-served configs	Open source + hosted	Experimentation is the primary use case and metrics live in a warehouse	Free self-hosted; hosted scales with requests
PostHog	Flags wired to analytics, session replay, error tracking	Open source + hosted	You want one tool for flags plus product analytics and don't need enterprise approvals	Pay-per-request; generous free tier
DIY (postgres + Redis)	Zero vendor lock-in, full control	Internal service	You have fewer than 50 flags, simple targeting, and want no external dependency	Engineering time — cheaper than it looks until it isn't

The decision usually comes down to two questions. First: is experimentation a core product function or a side activity? If it's core, GrowthBook or PostHog win because the flags and the metrics share a data model. Second: does the organisation need approval workflows, audit trails, and role-based flag changes for compliance? If yes, LaunchDarkly or Unleash Enterprise earn their premium. Everyone else should consider PostHog or a DIY path for the first year.

A feature gate, written the boring way

Regardless of vendor, the right shape for a feature gate is a single function call with a default value and a context object. The default value is what the code does if the flag service is unreachable — which means flags can never take down your product.

// A feature gate that degrades safely
import { flags } from "@/lib/flags";

export async function renderInvoice(user: User, invoice: Invoice) {
  const useNewPdfRenderer = await flags.isEnabled("pdf-renderer-v2", {
    user: {
      id: user.id,
      orgId: user.orgId,
      plan: user.plan,
      country: user.country,
    },
    defaultValue: false, // if flag service is down, old path runs
  });

  if (useNewPdfRenderer) {
    return renderPdfV2(invoice);
  }
  return renderPdfV1(invoice);
}

// A kill switch reads the same way — default true, flipped to false during incident
export async function chargeCustomer(customer: Customer, amount: number) {
  const chargesEnabled = await flags.isEnabled("stripe-charges-enabled", {
    defaultValue: true,
  });
  if (!chargesEnabled) {
    throw new ServiceUnavailableError("charges temporarily disabled");
  }
  return stripe.charges.create({ customer: customer.id, amount });
}

Two properties make this shape scale. First, the call is a pure function of the flag name and a context — no hidden global state, no framework coupling. Second, the default value is explicit at the call site, so any code reviewer can tell what happens when the flag service fails. Both properties survive the transition from one vendor to another.

Progressive rollouts: the ramp that keeps incidents small

A 1-5-25-50-100 ramp over 7 to 14 days catches the majority of production regressions at a fraction of a full blast. The discipline is not in the ramp itself but in the gate between steps. Each step is a commitment: error rate, latency p95, and the metric the feature is supposed to move must all stay within bounds for a defined window before the next step happens.

Day 0: enable for the internal staff org plus a small handful of trusted design-partner accounts. Run for 24 hours.
Day 1: ramp to 1% of traffic, consistently hashed by user ID so individual users are sticky.
Day 2: ramp to 5% if error rate and p95 latency on the gated code path stay within 5% of pre-rollout baselines.
Day 4: ramp to 25%. Verify product metric direction — if the feature is meant to increase activation, is activation trending up in the treatment cohort?
Day 7: ramp to 50%, then 100% over the next two days. Remove the flag from the code in the following sprint.

Auto-rollback is the highest-leverage piece of the ramp. Wire the flag service's API into the alert pipeline so a latency-p95 or error-rate breach automatically reverts the flag and pages on-call. This turns a 02:00 AM pager into an after-breakfast post-mortem.

Experiments: the rules that keep results honest

A feature-flag platform makes it trivial to split traffic. It does not automatically make the result valid. Three failure modes are common enough to name: peeking (reading the result early and stopping when it looks good, which inflates false positives), running an experiment without a pre-registered primary metric (which lets any post-hoc narrative fit), and assignment that isn't stable across sessions (which leaks treatment and pollutes the comparison).

Declare the primary metric, minimum detectable effect, and required sample size before traffic starts. Most flag platforms can compute the third from the first two.
Let the experiment run the full duration. Early reads are fine for safety checks (error rate, crash rate) but not for the primary outcome.
Use a stable hash (user ID or a persistent anonymous ID) as the assignment key. Session-based assignment biases results for any metric that spans sessions.
Log the assignment at the edge of every treatment-sensitive code path. Downstream analytics join on the assignment event, not on flag-evaluation heuristics.

Flag debt: the tax nobody budgets for

Every feature flag is code. Every flag at 100% that still has a branch in the codebase is code plus confusion. Teams that add flags faster than they remove them end up with hundreds of lines of permanent-looking conditionals, test matrices that don't fit on a screen, and onboarding docs that lie about which path is live. The only defence is mechanical hygiene.

Flag debt grows quietly until it doesn't. A codebase with 300 active flags has exponentially more possible configurations than test coverage exists for, and the first serious bug in an unexpected combination is the day the debt comes due. Set a cap — 50 active flags for most SaaS teams is the upper bound — and enforce it.

Every flag has an owner and a removal date at creation. The flag platform's UI displays both.
A weekly automated report lists flags past their removal date and assigns cleanup to the owner. It goes into a real ticket, not a Slack message.
CI fails when a flag has been at 100% for more than 30 days without a cleanup commit. Yes, really — that's the discipline that works.
Retire the flag and remove the dead branch in the same PR. Half-removals (flag deleted, branch still present) become booby traps.
Annual review of every kill switch. Some are still protecting risky paths; others can be retired because the underlying risk is gone.

What we'd pick for a new SaaS today

For a pre-enterprise SaaS with a small team, PostHog covers flags, experiments, and product analytics in one bill and keeps the primitives clean. For a team that already runs its own analytics warehouse and cares deeply about experimentation, GrowthBook self-hosted is the right shape. For a product hitting enterprise deals that require SOC 2-flavoured governance around flag changes, LaunchDarkly is still the least friction. Avoid DIY unless flag count will stay under 50 and targeting will stay simple — the moment either grows, a vendor is cheaper.

Key takeaways

Separate kill switches, progressive rollouts, and experiments from day one. They have different lifespans, owners, and exit criteria.
The right gate shape is a pure function of flag name and context, with an explicit default at the call site so the flag service is never on the critical path.
Ramp progressively with a 1-5-25-50-100 schedule, gated on error rate and latency. Auto-rollback on breach is the highest-leverage piece.
Run experiments with a pre-registered primary metric, stable hash assignment, and full duration. Peeking is how false positives get shipped.
PostHog for all-in-one, GrowthBook for warehouse-native experimentation, LaunchDarkly for enterprise governance, Unleash for self-hosted infrastructure-style flags.
Flag debt is a real tax. Owner + removal date at creation, CI enforcement at 100%, and an annual kill-switch review keep the codebase honest.

#feature-flags#experimentation#a-b-testing#rollouts#saas#launchdarkly#posthog

Feature flags, gradual rollouts, and experimentation for SaaS teams

The three flag archetypes

The vendor landscape in 2026

A feature gate, written the boring way

Progressive rollouts: the ramp that keeps incidents small

Experiments: the rules that keep results honest

Flag debt: the tax nobody budgets for

What we'd pick for a new SaaS today

Key takeaways

Related posts

Observability for SaaS: metrics, logs, traces, and the tools that matter

Next.js 15 App Router production patterns: what's actually changed

The complete guide to SaaS metrics in 2026: MRR, ARR, LTV, CAC, NDR

Let's build it together.