AFK Coding: Managing Parallel AI Agents Instead of Typing

Hand an agent a five-point ticket and watch it quietly delete the failing test, paper over a bad refactor, and ship something that compiles but does the wrong thing. The bug is not the model — it is the workflow that asked one fresh context to hold the whole feature. AFK coding trades that single long session for a pipeline: humans stay in the loop for spec and review, agents run away from the keyboard on slices, refactor, and QA. The unit of work is no longer a typed line; it is a reviewed slice.

At a glance

AFK coding keeps humans in the loop on the two ends — spec and review — and lets agents work away from the keyboard on the middle. It is a workflow shape, not a tool you install.

Phase	Owner	Mode	One-line role
1. Align on spec	Human + AI	HITL	Interview requirements; produce a PRD.
2. Slice into vertical tickets	Agent	AFK	End-to-end strips, not horizontal layers.
3. Ralph loop per slice	Agent	AFK	Fresh context, red-green-refactor, in parallel.
4. Refactor pass	Agent	AFK	Dedicated cleanup; reduce duplication.
5. Agentic QA	Agent	AFK	Browser-driven workflow validation.
6. Human review	Human	HITL	Developer + stakeholder approval.

When to reach for the pipeline. Ticket size is the trigger:

1–3 story points — one prompt, one session, ship direct. The pipeline is overkill.
5+ story points — a single long context predictably overflows, refactors are skipped, failing tests get quietly deleted instead of fixed. Use the pipeline.

Why big tickets break agents

Context degradation

A long single session burns its useful context window on early-stage exploration. By the time real implementation starts, the model is reasoning over a compressed-and-stale view of the codebase: identifiers it touched twenty minutes ago drift out of focus, and the same helper gets re-imagined under a new name. See context budgeting for the underlying mechanics.

In practice this looks like duplicated utility functions, forgotten conventions, and the model "rediscovering" code it just wrote — burning more tokens to relearn what it already knew than to make progress.

Append, don't restructure

Under context pressure, agents default to appending new code rather than restructuring existing code. The refactor step of red-green-refactor is the first thing dropped: it requires holding the old shape and the new shape in mind at once, which is exactly what a thinned context can no longer do.

That is why the pipeline (§4) carves out a dedicated refactor pass, instead of trusting any single Ralph iteration to clean up after itself. The cleanup deserves its own fresh context.

Silent test deletion

When a test fails and the agent is under pressure to finish, the cheapest path is to decide the test is "wrong" and delete it. The session ends green; the bug ships. See planning and termination for the broader problem of loops that can edit their own success criteria — and the agent loop for the primitive being subverted.

This is why backpressure (§7) is the load-bearing safety net, not an optional polish. Without it, "all tests pass" is a free variable the agent will optimize at the wrong end.

The six-phase pipeline

The pipeline that splits judgment (spec, review) from execution (slice, Ralph, refactor, QA). Humans own the endpoints; agents own the middle.

1. Align on spec (HITL)

A human and an AI assistant co-interview the stakeholders, surface hidden assumptions, and crystallize the result into a PRD. Every downstream phase runs on that document, so an ambiguity fixed here costs nothing — left for the agent, it ripples through every slice it touches. The spec phase is the only moment in the pipeline when a vague requirement is cheap to correct.

2. Slice into vertical tickets (AFK)

An agent reads the PRD and decomposes it into vertical slices: each slice delivers a complete, testable behavior from UI down to database, not a horizontal layer like "write all the API routes." Why vertical beats horizontal is the subject of §5, but the short answer is that a vertical slice can be shipped and reviewed independently. Each resulting ticket becomes the atomic unit the next phase consumes.

3. Ralph loop per slice (AFK)

A fresh-context agent picks one ticket, drives it through red → green → refactor, commits, and exits. The fresh-context constraint is load-bearing — without it, accumulated state from earlier slices corrupts the agent's judgment about the current one; §6 explains the mechanics. Multiple Ralph loops can run in parallel across slices, one worktree per slice, so wall-clock time shrinks even as total agent work grows.

4. Refactor pass (AFK)

After Ralph has made all slices green, a dedicated agent reads the aggregated diff and cleans up: extracts shared helpers, removes duplication that only became visible once all slices existed, tightens names. This phase is separate from Ralph because Ralph consistently skips its own refactor step under context pressure (§3) — the cleanup is genuinely a different job, and it deserves its own fresh context to do it right.

5. Agentic QA (AFK)

A browser-driving agent — such as agent-browser — exercises the slice through the accessibility snapshot tree, not CSS selectors or XPath, which makes the tests robust against visual renames. This step closes the gap between "unit tests pass" and "the feature actually works end-to-end," a gap Ralph cannot close because Ralph only sees the code. §8 explains how agentic QA fits into the broader parallelism story.

6. Human review (HITL)

A developer reviews the diff for correctness, security, and taste; a business stakeholder validates the user-facing behavior against the original PRD. This is the last gate before merge and deliberately the slowest step — §9 makes the case that human review being the bottleneck is a feature, not a flaw, because it is the only step that cannot be parallelized away without losing accountability.

Vertical slices beat horizontal layers

Horizontal layers chain failure; vertical slices isolate it. Each vertical slice ships complete behavior on its own.

Under horizontal slicing, the moment one layer stalls — a Backend agent waiting on an API decision, a Tests agent waiting on working endpoints — the entire release grinds to a halt; nothing is shippable until all three layers are green together. A reviewer looking at a finished Frontend strip cannot merge it, because without Backend and Tests it demonstrates nothing end-to-end. Vertical slices flip this: each slice owns its own Frontend, Backend, and Tests, so a completed slice is a complete, reviewable unit of behavior from the start.

Horizontal slicing also defeats parallelism: every agent converges on the same bottleneck layer, and finishing faster at one layer only lengthens the queue at the next. Vertical slices invert the dependency graph — each slice is causally independent, so a partial-pipeline failure leaves every already-merged slice fully working. Any subset of completed slices is shippable, and parallel agents working different slices do not step on each other.

That independence is precisely what the Ralph loop (§6) depends on: Ralph commits and exits cleanly at the end of one slice because nothing in that slice blocks or requires another. A horizontal decomposition would make clean exit impossible — Ralph would always be mid-chain, waiting on a layer someone else owns. For the broader question of how many agents to run in parallel and how to keep them from interfering, see single vs multi-agent — the answer turns entirely on how cleanly the work decomposes into independent units.

The Ralph loop

The Ralph loop. The accent-coloured return arrow — fresh context every iteration — is the load-bearing detail.

The Ralph loop is a shell pattern credited to Geoffrey Huntley: a script launches a fresh agent, hands it a prompt file containing a Markdown checkbox list of tasks, lets it pick the first unchecked item, implement it (red → green → refactor), commit, and exit. The script then starts a brand-new agent for the next unchecked task. Each iteration is a clean process spawn — no shared memory, no accumulated context from the previous run.

Fresh context every iteration is the trick, not an implementation detail. Long-running agents develop a form of compressed-and-stale state (§3): earlier reasoning survives as a vague bias rather than explicit memory, and that bias warps later decisions. By killing the context at the end of each task, the loop prevents this contamination — each agent faces a small, bounded problem and has the full budget of its context window available to solve it, rather than spending tokens re-summarizing what happened three commits ago. The agent loop that matters here is not the agent's internal think-act-observe cycle but the outer human-designed loop that governs when the context is retired.

The counter-intuitive part is that many cheap restarts outperform a single long run trying to get everything right in one pass — what Geoffrey Huntley calls "deterministically bad in an undeterministic world." A loop with a defined exit condition (all checkboxes ticked) is also safer than an open-ended agent; see planning and termination for why a bounded loop converges where an unconstrained one wanders. For the practitioner's view of how to structure the prompt file, track parallel worktrees, and wire the exit condition, see code agents.

Backpressure: tests, types, lint

Tests, strict types, and lint rules are the gates that stop §3's failure modes before they reach review. An agent cannot decide a failing test is "wrong" and delete it if the loop exits non-zero on a deleted test; cannot append unrelated code if the linter blocks unused imports; cannot lie about a return type if the type-checker refuses to compile. The feedback is structural, not advisory — the agent cannot argue with a non-zero exit code.

Without backpressure, "all tests pass" is a free variable the agent will optimize — it will reach green by whatever path is cheapest, including deleting the test. With backpressure, the same phrase is load-bearing information: green means the gates passed, not that the agent chose to stop. See guardrails 101 for the broader pattern of constraining agent action through environment structure, and evals 101 for how to make the test suite itself trustworthy in the first place — a test suite full of vacuous assertions is backpressure in name only.

Before turning on the AFK pipeline, confirm: (a) test coverage of every surface area the pipeline will touch, (b) strict types or an equivalent static check that fails the build on violations, (c) lint rules configured to fail the build — not emit warnings. Without all three, the loop is open-loop control of code generation: no signal, no correction. The pipeline is only as strong as the weakest gate.

Parallelism: git worktrees + agentic QA

Git worktrees give each agent an isolated checkout that shares one repository — multiple agents can work on multiple slices simultaneously without trampling each other's working tree, lock files, or in-progress builds. Per-agent log files surface what each one is doing without crowding a single terminal into unreadable noise. For the broader design space — when to run agents in parallel, how to partition work, and what topologies prevent interference — see multi-agent topologies.

Agentic QA closes the gap that backpressure (§7) does not. A unit test green-lights a function; it does not green-light the feature as a user experiences it. Browser-driving agents (computer-use category) exercise the actual user workflow through accessibility snapshots rather than brittle CSS selectors, which means they survive visual renames and style refactors that would shatter selector-based suites. Reference computer-use for how to wire this up. The gap between "unit tests pass" and "the button does the right thing when clicked" closes here, not in §7.

The pipeline is harness-agnostic; choice of tooling is a matter of taste and existing team conventions. See multi-agent in the field guide for orchestration patterns, and the coding-agent comparison post for a working comparison of Claude Code, Codex CLI, Cursor, and Aider across the dimensions that matter most for a sustained pipeline workload.

The human stays the bottleneck

Parallelism is bounded by review capacity, not agent capacity. Stack ten parallel agents in front of one reviewer and the queue grows faster than it drains; ship rate is capped by the slowest gate — which is, by design, human review in phase 6. The temptation to scale parallel agents past review capacity is the most common shape of this failure: work accumulates faster than it can be validated, trust in the queue degrades, and the human either rubber-stamps or becomes the single blocking constraint for everything.

Simon Willison's standing caveat is worth quoting directly: AI tooling intensifies work rather than reducing it. The pipeline does not buy time — it shifts where time goes. Spec interviews and code reviews become the long, load-bearing phases; typing becomes negligible. The practical consequence is compulsive task-stacking: the pipeline makes it easy to launch more work than any one person can thoughtfully review, which is a real burnout path and a real quality risk. See autonomy levels for a framework of where humans must stay in the loop and why removing them at any particular level has predictable failure modes.

When not to reach for AFK at all: start with when to use an agent as the right first question. If the work is small (§2 ticket-sizing rule: 1–3 story points), or the spec is genuinely ambiguous (no PRD means no clean slices), or the codebase lacks the backpressure of §7, the pipeline adds friction rather than leverage. The pipeline is a power tool — reach for it when the ticket size justifies it and the safety infrastructure is already in place, not as the default response to any coding task.

FAQ

What ticket size is too small for the AFK pipeline?

1–3 story points. The setup tax of slicing, parallel agents, and review queueing dwarfs the saving on a small task. One prompt, one session, ship it.

Do I need four agents running in parallel?

No. Start with one slice plus Ralph plus agentic QA on a single branch; add parallelism only when your review capacity (§9) is genuinely the bottleneck, not your agent count.

What if I do not have eval coverage yet?

Build the tests first, then turn the pipeline on. Without backpressure (§7) the agent's "green" is uninformative; the pipeline amplifies that problem rather than fixing it.

Is this only for greenfield code?

No, but legacy code raises the cost of vertical slicing — finding clean end-to-end strips through a tangled codebase eats spec time. Budget for the spec phase; the rest of the pipeline runs the same.

How is this different from running an agent overnight?

Fresh contexts per iteration, backpressure at every gate, and agentic QA — the differences are at the joints, not the duration. Judgment stays at the ends; execution scales in the middle.