Give the same agent a Python snippet to run — fit a regression, plot it, summarize the result — and the four runtimes that can host it look identical on the surface: spin up a sandbox, send the code, stream stdout. Underneath, the question they answer differently is the one that bites in production: who owns the sandbox lifecycle. E2B hands your app raw Firecracker microVMs as an SDK primitive. Modal pools warm serverless containers and bills you by the call. Daytona treats the sandbox as a developer workspace that outlives any single run. Anthropic Code Execution buries the container inside the model loop so deeply that "starting a sandbox" is just another tool call on the assistant message. Pick the wrong owner and you are either operating an infrastructure layer you did not want to operate, or letting the platform decide things you needed control over.
At a glance
Four runtimes, four different homes for the container the agent's code runs in — and four different answers to who creates it, how long it lives, and what happens when the run ends.
| Runtime | Released / maintainer | Primary niche | Sandbox lifetime |
|---|---|---|---|
| E2B | 2023, E2B (open source) | Firecracker microVMs as an SDK primitive | Seconds to ~24h, app-controlled |
| Modal | 2022, Modal Labs | Serverless functions as per-call containers | Per call, warm pool reused |
| Daytona | 2024, Daytona Platforms | Dev-environment-as-code, IDE-shaped workspaces | Hours to days, IDE session |
| Anthropic Code Execution | May 2025, Anthropic | Server-side container tool inside the model loop | Per conversation, API-owned |
Snapshot: 2026-06-01. Sandbox runtimes change quickly; verify against current docs.
E2B — deep dive
Isolation model: Firecracker microVMs
E2B's sandboxes are Firecracker microVMs — the same KVM-based, minimal-attack-surface VM technology that AWS Lambda runs on. Each sandbox boots a fresh Linux kernel in roughly 200ms, with its own kernel, init, and filesystem. That is a meaningfully stronger isolation primitive than a shared-kernel container: a sandboxed process cannot exploit a Linux kernel bug to break out into a peer sandbox, because there is no shared kernel to exploit. The agent gets a real /, real /proc, and a real network stack, all behind a hypervisor boundary. That matters when the code being run came from a model that an attacker may have prompt-injected.
Sandbox lifecycle: app-owned
The lifecycle is entirely in your hands. e2b.Sandbox.create() spins up a microVM (default timeout 5 minutes, configurable up to ~24 hours); sandbox.kill() tears it down; sandbox.set_timeout() extends it mid-run; Sandbox.connect(sandbox_id) reattaches across processes so a long-running agent can survive its own restarts. You can keep one sandbox per conversation, pool them across users, or burn one per tool call — E2B does not assume a shape. Templates (preinstalled images) make boots cheap, so the "always create fresh" pattern is workable rather than wasteful. Persistent state lives inside the sandbox's filesystem for its lifetime; if you need it to outlive the VM, you upload, download, or mount it yourself.
Integration shape: SDK primitive
The integration is a library, not a service abstraction. You import the SDK (Python or JS), get a Sandbox object, and call methods on it — run_code, commands.run, files.write, files.read. The agent framework you are using (LangGraph, OpenAI Agents SDK, your own loop) wraps those calls as tools. E2B does not push a particular agent shape on you; it is the layer below your agent loop, just like a database client is. The repo is open source (Apache-2.0), and you can self-host the control plane if you must keep everything inside your own boundary.
Modal — deep dive
Isolation model: gVisor-style containers
Modal runs your code in containers rather than VMs, but with a userspace kernel (gVisor / runsc) sitting between the workload and the host kernel. That gives stronger isolation than a vanilla runc container — the syscall surface visible to the workload is intercepted by the gVisor sentry instead of going straight to the host — at a noticeable per-syscall cost. The trade is the inverse of E2B's: containers cold-start faster and pool warmer, but the isolation primitive is a userspace kernel, not a hypervisor. For most agent workloads (running model-generated Python on user data), that is a sensible point on the curve.
Sandbox lifecycle: platform-owned
Modal owns the lifecycle. A @app.function() is a serverless function: each call provisions a container, runs the function, returns the result, and the container goes back to a warm pool to wait for the next call. Scaling, queueing, and resource limits are the platform's job, not yours. The Sandbox primitive — the one agents actually use for arbitrary code — is a sibling abstraction: modal.Sandbox.create() opens a long-lived container the agent can shell into and execute against, with a configurable timeout and the ability to attach more processes mid-run. You still write a Modal app and deploy it; the platform decides which physical host the container lands on and when warm pools recycle.
Integration shape: serverless function + Sandbox
Modal is a compute platform first, sandbox-for-agents second. The headline ergonomics — define a function in Python, decorate it, run it as cloud compute with GPUs available on demand — were built for ML inference and batch jobs; the Sandbox primitive landed later for agent use cases. Practically that means Modal is the strongest of the four for "agent that needs to fine-tune a model, then plot the result" workflows where heavy GPU compute and code-execution-as-tool overlap. It is the weakest of the four if all you want is a one-line "give me a sandbox" SDK call without learning Modal's app / function model.
Daytona — deep dive
Isolation model: container/VM workspace
Daytona's sandbox is a workspace: a container (or, on some deployments, a full VM) provisioned from a workspace definition — devcontainer.json, a Dockerfile, or a Daytona-specific config — with the project's source already checked out and the dev toolchain already installed. The isolation primitive is plain container isolation by default; the security story is closer to "this is your developer environment, not untrusted code" than to E2B's hypervisor boundary. If your agent is doing operations on code your team owns (refactor this repo, run these tests), that matches the threat model; if the agent is running adversarial code from end users, it does not.
Sandbox lifecycle: workspace-shaped, longer-lived
The lifecycle assumption is the inverse of Modal's: workspaces are long-lived, measured in hours and days rather than per-call. A workspace boots once, holds the project state on its filesystem across many tool calls and IDE sessions, and only goes away when explicitly stopped or auto-suspended. That matches the original product — "stop fighting your dev environment, get a reproducible one in 30 seconds" — and it matches a coding agent that runs many edits and tests against the same checkout. The same workspace can be attached by a human developer over SSH/JetBrains/VS Code at the same time, so the agent and the human share state by design.
Integration shape: dev-environment API
Daytona ships as a platform + CLI + SDK. The Daytona platform is open source (AGPL); you can run it on your own Kubernetes or use the managed cloud. The agent integration is an SDK that creates workspaces, runs commands in them, reads and writes files, and tears them down — a thin layer over the same primitives the IDE uses. The shape is right for coding agents and developer-environment-style tasks; it is heavier than necessary for "give me a Python REPL for the next 30 seconds to run a model's plot."
Anthropic Code Execution — deep dive
Isolation model: server-side sandboxed container
Anthropic's Code Execution tool runs your model-generated Python in a sandboxed container on Anthropic infrastructure. The isolation primitive is not documented as deeply as Firecracker or gVisor — the official description is "secure, sandboxed environment" — but the relevant property is that the container is not yours: you do not provision it, you do not see the host, and you do not pick the region beyond the API endpoint. Network egress is restricted to a documented allowlist (PyPI for installs, a few approved endpoints); inbound network is not available. The agent gets a Jupyter-like Python environment with the scientific stack preinstalled.
Sandbox lifecycle: API-owned, conversation-scoped
The lifecycle is owned entirely by the API. You enable the tool by passing {"type": "code_execution_20250522", "name": "code_execution"} in the tools array of a Messages request; the model then decides when to call it, and Anthropic provisions, reuses, and tears down the container. Crucially, the container persists across turns within the same conversation for up to one hour and ~1 GB of state, so variables and files survive multiple model turns the way a Jupyter kernel does. You cannot extend it, reattach to it from a different conversation, or migrate the state out except by reading files back via the Files API. The container is gone when the conversation ages out.
Integration shape: model-loop tool
The integration is zero-SDK: this is a server-side tool baked into the model loop, exactly like web search or the computer-use tool. You make a Messages API call with the tool declared and your prompt; the response comes back with the model's plan, the code it ran, the stdout it saw, and the final answer interleaved as content blocks. No sandbox object lives in your code. The trade-off is symmetrical to Claude Managed Agents elsewhere in the ecosystem: you give up control of the runtime (region, kernel, package set beyond defaults, network policy) and gain a runtime you do not operate.
Cross-cutting comparison
Isolation primitive
This is the axis where the four sit furthest apart, and it is the axis adversarial code-execution scenarios live or die on. E2B is the strongest: Firecracker is a real hypervisor boundary, with no shared kernel between sandboxes — the same primitive AWS chose for Lambda and Fargate. Modal sits a step closer to the host: gVisor's userspace kernel intercepts syscalls, which is meaningfully harder to escape than vanilla containers but is still a software boundary in the host's kernel address space. Daytona is the most relaxed by default — standard container isolation suitable for trusted developer code, not for adversarial user-submitted code — though you can harden it with your own Kubernetes network policies. Anthropic Code Execution is the least transparent: the docs promise a "secure, sandboxed environment" but do not name the primitive, so you are trusting Anthropic's operational story rather than auditing the boundary. Choose the leftmost primitive that still meets your latency and ergonomic budget.
Lifecycle ownership
The lifecycle question is "who decides when this container is born and dies, and who is on the hook when it leaks." E2B puts that decision in your application code: you call create, you call kill, you reattach across processes, and a forgotten kill() is a bill you pay. Modal hands it to the platform: warm pools, per-call provisioning, autoscaling, and the leak risk is its problem, not yours. Daytona pushes the lifetime out to the workspace, meaning a sandbox is something more like a dev environment that an agent and a human both share for the duration of a project — a shape that is a great fit for coding agents and a terrible fit for one-shot model tool calls. Anthropic Code Execution collapses the question entirely: the container lives for a conversation, then is gone, and you do not get to argue with that. The right pick depends on whether your agent's natural unit of work is a tool call (Modal, Anthropic), a session (E2B), or a project (Daytona).
What the agent actually sees
From the agent's point of view — what filesystem, what network, what package manager — the four are very different rooms. E2B and Modal give the agent a full Linux environment with whatever you put in the template image: arbitrary apt install, arbitrary outbound network, GPUs on Modal. Daytona goes further on toolchain: the project source is already mounted, the language runtime is installed per the workspace definition, git and a shell are immediate. Anthropic Code Execution is the most curated: a fixed Python 3 environment with the scientific stack preinstalled (numpy, pandas, scipy, matplotlib, scikit-learn), pip install permitted but restricted to PyPI, no inbound network, outbound limited to a documented allowlist. The implication is direct: if the agent needs to curl a random API or run a TypeScript script, three of these can and Anthropic cannot. If the agent only ever needs to run model-suggested data-analysis Python, the curated environment is the safer default precisely because there is less surface to attack.
When to pick which
| Use case | Pick E2B if… | Pick Modal if… | Pick Daytona if… | Pick Anthropic Code Execution if… |
|---|---|---|---|---|
| Running untrusted model code | You want the strongest isolation primitive on the list — a real hypervisor boundary, per-sandbox. | gVisor isolation is enough for your threat model and you value the warm-pool latency story. | Not the natural fit — workspace defaults assume trusted developer code. | You want Anthropic to own the sandbox boundary, and a curated Python environment is enough. |
| Coding agent on a real repo | Workable but heavier than needed — you will rebuild what a workspace already gives you. | Workable for short-lived test runs; awkward for the long-lived dev-environment shape. | You want a checkout + toolchain + IDE-shaped workspace the agent can shell into for hours. | No — there is no project checkout, no shell, and the container vanishes per conversation. |
| One-shot Python in a chat | Workable — create, run, kill — but you pay for boot + integration code each time. | You like the warm-pool latency for repeated short calls inside your own infra. | Overkill — workspace boot is slower than a per-call sandbox. | You want it built into the model loop with zero infrastructure code on your side. |
| GPU-heavy compute alongside execution | Possible with a custom template, but not the strongest path. | Strongest fit — Modal's GPU story is its origin, and the Sandbox sits alongside it. | Possible but workspace-shaped; not a serverless GPU primitive. | No — no GPU access, no custom compute beyond the curated Python image. |
| Self-host inside your boundary | Yes — open source, you can run the control plane on your own infra. | No — Modal is a hosted platform. | Yes — Daytona is open source (AGPL) and runs on your Kubernetes. | No — by design, the container lives on Anthropic infrastructure. |
FAQ
What's the difference between E2B and Anthropic Code Execution?
They sit at opposite ends of the lifecycle-ownership spectrum. E2B is an SDK: your app calls e2b.Sandbox.create() to spin up a Firecracker microVM, drives it with method calls, and explicitly kills it — every lifecycle decision is in your code, and you can self-host the control plane. Anthropic Code Execution is a server-side tool inside the Messages API: declare it in the tools array and the model decides when to run code in a container that Anthropic provisions, persists for one conversation, and tears down. Reach for E2B when you want a strong isolation primitive on infrastructure you control; reach for Anthropic Code Execution when you want zero infrastructure code and a curated Python environment built into the model loop.
Is Modal a sandbox runtime or a serverless platform?
Both, but it started as the latter. Modal's original ergonomics — decorate a Python function, run it as serverless cloud compute with GPUs available on demand — were built for ML inference and batch jobs. The Sandbox primitive, which is the one agents actually use for arbitrary model-generated code, is a sibling abstraction added later. That matters when picking: Modal is the strongest of the four if your agent needs to fine-tune a model and then run code against the result; it is heavier than the others if all you want is a one-line "give me a Python REPL" SDK.
Can Daytona run untrusted code from end users?
By default, treat it as no. Daytona's threat model is the developer experience — give the agent and the developer a reproducible workspace with the project checked out and the toolchain installed — and the isolation primitive is standard container isolation. That is fine for code your team owns or that an agent generates against your own repo, but it is not the right boundary for adversarial user-submitted code. If you do need to run untrusted code through Daytona, lean on Kubernetes-level network policies, per-tenant clusters, or a stronger runtime like gVisor underneath, and audit the configuration carefully.
How long does the Anthropic Code Execution container persist?
Up to one hour and ~1 GB of state, scoped to a single conversation. Within those bounds the container behaves like a long-lived Jupyter kernel: variables, files, and imports survive across model turns, so the model can compute a dataframe in one turn and plot it in the next. The container is gone when the conversation ages out, and you cannot extend it or reattach from a different conversation — if you need long-lived per-user state, you must read files back via the Files API and re-upload them on the next conversation.
Why would I pick a microVM over a container?
Stronger isolation per sandbox, at the cost of slightly higher cold-start. A Firecracker microVM has its own kernel, init, and address space behind a hypervisor — a sandboxed process cannot exploit a Linux kernel bug to break out into a peer sandbox, because there is no shared kernel to exploit. A standard container shares the host kernel; even a gVisor-protected container is a software boundary in the host kernel's address space. For agent workloads where the code came from a model an attacker may have prompt-injected, the hypervisor boundary is the conservative choice. For trusted code, the container is faster and cheaper.
When should I write my own sandbox instead of using one of these?
Almost never, and the smaller the team the more never it is. The hard parts of a code-execution runtime — strong isolation, package install, file IO, stdout streaming, idle eviction, billing, multi-tenant security — are the entire product of these four projects, and they have absorbed years of failure modes you would otherwise rediscover. The realistic write-your-own threshold is "we have a regulatory boundary that none of these meet" or "our existing platform already gives us a container primitive we trust." See tools, actions, and environments for why the environment is usually where agents become dangerous — picking the right runtime is most of that defense.
Further reading
On this wiki:
- Tools, actions, and environments — why the environment, not the model, is usually where agents become dangerous, and the threat model these runtimes are answering.
- Tool calling explained — the wire-level shape of a tool call, which is what each of these four runtimes ultimately gets wired into.
- The Agent Loop — the perceive-decide-act cycle that wraps every code-execution sandbox, made explicit so you can see where the runtime sits.
- Agentic risks intro — the risk taxonomy that decides how strong an isolation primitive you actually need.
Project sources:
- E2B docs — Sandbox SDK, Firecracker microVMs, templates, and self-hosting (source at github.com/e2b-dev/E2B).
- Modal docs — apps, functions, the Sandbox primitive, GPUs, and warm pools.
- Daytona docs — workspaces, definitions, and the agent SDK (source at github.com/daytonaio/daytona).
- Anthropic Code Execution docs — the
code_execution_20250522tool, runtime environment, and limits.