Browser agents.
Driving a real browser as a tool — DOM versus pixel observation, login + auth state, the well-trodden failure modes, and when to step up to a full GUI agent. The browser sits at a useful middle altitude: narrower than full computer-use, broader than raw HTTP. Most of the "agent that does things on the web" use cases live here, and so do most of the regressions.
Why "browser as a tool" is a distinct surface.
A browser agent drives Chrome, Firefox, or a headless equivalent through a programmatic harness (Playwright, Puppeteer, CDP). It is not the same problem as a general computer-use agent: the harness gives you the DOM, navigation events, network logs, and screenshot access in one place. It is also not the same as HTTP-scripting: cookies, JavaScript state, OAuth redirects, and SPA routing all happen for free.
That middle altitude is exactly why browser agents have such a wide productive surface — checkout flows, dashboard scraping, portal navigation, "log in and download my data" jobs. It is also why the failure modes are uniquely sneaky: the page looks like a human-readable page, but the agent is operating on a structure that drifts under it on every release.
DOM vs pixel observation — and the hybrid that wins.
Two ways to know what is on the page:
- Accessibility-tree / DOM observation — semantic, addressable, robust to a redesign that keeps element roles stable. Selectors that target ARIA roles ("button named 'Submit'") survive most cosmetic refreshes; selectors that target generated class names break with every deploy.
- Pixel / screenshot observation — works on any page, including canvas-rendered UIs, iframes the harness cannot reach into, and pages that aggressively obfuscate the DOM. Fragile to layout shifts, DPI, and ad-induced reflows.
Most production browser agents combine the two. The accessibility tree is the primary observation; a screenshot is the verification ("did the click actually navigate, or did I miss?"). The hybrid pattern is the same as in computer-use agents, but the DOM gives you a stable starting point pixel-only agents do not have.
Lean on accessibility-tree selectors over CSS-class selectors wherever you can. Selectors built on visible role + accessible name survive a redesign; selectors built on .btn-primary-3f1a2 do not. A browser agent whose selectors are tied to generated class names is a maintenance bill in waiting.
Auth + session state is half the job.
Most browser-agent tasks worth doing require being logged in. Three modes:
- Persistent profile. The agent uses a long-lived browser profile with stored cookies. Simple, but the session lives outside the agent's per-request scope — if the agent is compromised at any point, every authenticated surface in that profile is reachable.
- Ephemeral session with delegated credentials. The agent receives a short-lived token (OBO-style, or a session cookie minted just-in-time) and runs in a fresh profile that is destroyed at the end of the run. Higher friction, smaller blast radius.
- Real-time human auth. For MFA-gated surfaces, the agent pauses and surfaces the challenge to a human, who completes it. The agent never sees the second factor.
The pattern is the same as in sandboxing & safe execution: scope the credentials, and assume the agent's runtime is the attack surface. A browser agent with admin-grade cookies in a long-lived profile is a credential-theft incident waiting to happen.
The well-trodden failure modes.
If you ship browser agents, you will hit all of these. Plan for them, do not be surprised:
- Infinite-scroll traps. The agent thinks it is paginating; the page is loading the next chunk forever. Bound the scroll, set a "no new content" detector, and stop.
- CAPTCHA walls. The right answer is almost never "have the agent solve it." The right answer is "surface it to a human, or change the integration."
- Iframe and shadow-DOM boundaries. Many harnesses do not traverse them by default. A click that "should work" lands on the parent document and accomplishes nothing.
- Ad and consent pop-ups. They appear unpredictably, steal focus, and look just enough like the real UI to confuse the agent. Treat them as a known pre-step: detect, dismiss, then resume.
- Redirect loops and SPA routing. The URL changes; the DOM does not re-render; the accessibility tree snapshot is stale. Re-observe after navigation, not just on a fixed cadence.
Per-step reliability matters more here than in API-driven work — a 30-step browser flow at 97% per step succeeds barely a third of the time. Idempotent steps, checkpoints, and a "stuck-loop" detector (same DOM hash N turns in a row, break out) are non-optional.
When to step up to a full GUI agent.
The browser stops being enough in three situations:
- Native OS dialogs. A "Save As" dialog or a system permission prompt is outside the browser's sandbox; the browser-agent harness cannot reach it. A computer-use agent can.
- Multi-window orchestration across non-browser apps. If the workflow involves a desktop email client, a local PDF viewer, or a Citrix session, the browser is one window among several.
- Defended pages where the DOM is structurally unusable. Heavy obfuscation, canvas-only rendering, or aggressive anti-automation can push you below the DOM altitude entirely.
Even then, prefer the browser for the parts that work in the browser, and reach for full computer-use only for the steps that genuinely require it. The latency and reliability tax of pixel-only operation, covered in the computer-use essay, applies in full; the browser is cheaper precisely because it is narrower.