Feature flags for agents: prompts, models, tools, and policies are all flippable surfaces.
Classic feature flags gate code paths. An agent's surface area is bigger than that — prompts, tool registrations, model versions, policy thresholds, retrieval configs are all things you want to roll, scope, and rip back without a deploy. This essay is about what to put behind a flag, why "off by default" is non-negotiable for any flag with non-trivial blast radius, and what the audit trail must look like so a flip is reviewable after the fact.
The flippable surface is bigger than code.
A classic feature flag wraps an if branch around a code path. For an agent that is the small case — the interesting flags wrap the parts of the behavioral contract from rollout-and-versioning that you want to move independently. The list, in roughly decreasing blast radius:
- Model version — which dated snapshot a tenant or cohort resolves to. The single highest-impact flag you own.
- Prompt revision — which content-addressed prompt the run loads. Behavior change without a redeploy.
- Tool registration — whether a given tool is even exposed to the model for this run; the registration list is data, not code.
- Policy thresholds — the confidence cutoff a guardrail uses, the cost ceiling per task, the auto-approval limit. Tuned per cohort, not hard-coded.
- Retrieval config — which index, which top-k, which reranker; classically a deploy, productively a flag.
- Sampling parameters — temperature, top-p, max steps. Small numbers that change behavior a lot; flag-gate them once you have more than one production cohort.
The reframe: a feature flag for an agent is not "show this button" — it is a per-request override on a piece of the behavioral contract. Treat it with the seriousness that contract deserves.
Off by default, for anything with non-trivial blast radius.
Standard feature-flag hygiene says "default to off, expand when proven." Agent flags need this rule harder. The reason is asymmetric cost: a flag that defaults on for a new prompt or a new tool exposes every tenant the moment the flag exists, and a regression discovered an hour later has already touched real customers, real tokens, and real side effects. A flag that defaults off requires an explicit per-cohort opt-in, which forces the discipline of a canary plus a comparison.
Two rules that catch the common slip: (1) the default branch of an agent flag must be the incumbent behavior, never the new one, even when the new one looks safe in dev. (2) "Default on, exclude broken tenants" is the same bug as "default on" — by the time you know who is broken, the blast already landed.
Scope by request context, not just user-id.
The unit of evaluation that matters for an agent flag is the request, with the full agent context attached — tenant, user, run_id, the resolved release triple, the tool about to be called, the cost so far this run. A flag platform built on user-id alone cannot answer "expose the new search tool only on runs that have already burned less than $0.50" or "ramp the new prompt only on internal tenants whose users opted into the beta cohort."
# flags/eval.py — per-request evaluation with full agent context def resolve_flag(name, ctx): # ctx carries tenant, user, run_id, release, spend_so_far, tool rule = active_rule(name) if rule.scope == "tenant" and ctx.tenant in rule.allow: return rule.on if rule.scope == "cohort" and cohort_of(ctx.user) in rule.allow: return rule.on if rule.scope == "cost_cap" and ctx.spend_so_far < rule.under: return rule.on return False # off-by-default fallback
Evaluate per-request, log the resolved value into the run's journal alongside the release triple, and the flag becomes part of the behavioral contract a trace is attributable to. Skip the log and you have a flag whose effect cannot be reproduced after the fact.
Flag lifetime: retire them, or pay the long-tail bill.
Stale code-path flags are a familiar nuisance; stale agent flags are a hazard. A long-lived flag gating a policy threshold gradually becomes the policy — except nobody re-reviewed it, the cohort it serves has drifted, and the tenant who turned it on three months ago has long forgotten it. The two practices that keep this in check:
- Every flag has an owner and a retirement plan from the day it lands — either it becomes the default (incumbent flips), or it is removed. "We will leave this on for one tenant forever" means turning the flag into permanent configuration; do that explicitly with a config entry, not a flag with no owner.
- Audit flag age and usage quarterly — a flag that has been at the same value for 90 days for every cohort is a flag that should be code; a flag that has not been evaluated in 30 days is a flag that should be deleted.
Every flip and every evaluation is observable, or the flag does not exist.
The whole point of moving behavior into a flag is that you can flip it without a redeploy. That power earns its keep only if the flip itself is observable: who flipped what, to what, for which scope, at what time, and — separately — every per-request evaluation that ran under the new value, joinable by run_id to the journal entries in tracing-and-observability. Without the flip log, you cannot answer "did this regression start when we changed the prompt or when we flipped the threshold?"; without the per-request log you cannot answer "which runs ran under the new value?".
One more rule: the flag platform itself must fail closed. If the flag service is unreachable mid-run, the evaluation returns the incumbent default, not the new behavior. Otherwise an outage at the flag tier becomes an unannounced rollout of every flag's "on" path — a class of incident you do not want to discover the hard way.