AI Blog

E2B vs Modal vs Daytona vs Anthropic Code Execution: Four Owners of the Agent Sandbox

Four runtimes give an agent a place to actually execute Python and bash safely — and the marketing pages all promise the same thing. The thing that decides which one survives production is who owns the sandbox lifecycle.

By Agentic AI Wiki 24 min read

Give the same agent a Python snippet to run — fit a regression, plot it, summarize the result — and the four runtimes that can host it look identical on the surface: spin up a sandbox, send the code, stream stdout. Underneath, the question they answer differently is the one that bites in production: who owns the sandbox lifecycle. E2B hands your app raw Firecracker microVMs as an SDK primitive. Modal pools warm serverless containers and bills you by the call. Daytona treats the sandbox as a developer workspace that outlives any single run. Anthropic Code Execution buries the container inside the model loop so deeply that "starting a sandbox" is just another tool call on the assistant message. Pick the wrong owner and you are either operating an infrastructure layer you did not want to operate, or letting the platform decide things you needed control over.

At a glance

Four runtimes, four different homes for the container the agent's code runs in — and four different answers to who creates it, how long it lives, and what happens when the run ends.

Runtime Released / maintainer Primary niche Sandbox lifetime
E2B 2023, E2B (open source) Firecracker microVMs as an SDK primitive Seconds to ~24h, app-controlled
Modal 2022, Modal Labs Serverless functions as per-call containers Per call, warm pool reused
Daytona 2024, Daytona Platforms Dev-environment-as-code, IDE-shaped workspaces Hours to days, IDE session
Anthropic Code Execution May 2025, Anthropic Server-side container tool inside the model loop Per conversation, API-owned

Snapshot: 2026-06-01. Sandbox runtimes change quickly; verify against current docs.

Sandbox runtime feature matrix Heatmap comparing E2B, Modal, Daytona, and Anthropic Code Execution across six axes: isolation strength, lifecycle ownership, persistent filesystem, network reach, package install, and managed-vs-self. Strength indicated by fill color from light (weak) through medium to dark accent (strong). Sandbox runtime feature matrix Isolation strength Lifecycle owner Persistent FS Network reach Package install Self-host option E2B microVM (Firecracker) App-owned Within sandbox life Full apt + pip + anything Yes (OSS) Modal gVisor container Platform (warm pool) Volumes opt-in Full Modal Image layer Hosted only Daytona Container (default) Workspace (long-lived) Workspace FS persists Full devcontainer + runtime Yes (OSS, AGPL) Anthropic Code Exec. Server sandbox API-owned (per convo) 1 hr, ~1 GB PyPI + allowlist pip only No Weak Medium Strong
Where each runtime leans hardest. The marketing pages overlap; lifecycle ownership and isolation primitive are where they actually split.

E2B — deep dive

E2B architecture The agent process on the left holds the SDK; it calls create, run_code, and kill on a Firecracker microVM provisioned by E2B. Inside the VM, a runtime daemon hosts the Python kernel, a shell, and a filesystem. The app owns every lifecycle decision. YOUR AGENT APP Agent loop LangGraph / OpenAI SDK / yours E2B SDK Sandbox.create() sandbox.run_code() sandbox.kill() App owns lifecycle create · timeout · reattach forget kill() = your bill Firecracker microVM (hypervisor-isolated) Runtime daemon RPC over the SDK channel code / shell / files Python kernel long-running, stateful imports persist across calls Shell / bash apt install · curl · npm full Linux userspace VM filesystem /home/user · /tmp files.read / files.write Own kernel · own init · own /proc no shared kernel with peers — KVM boundary Template image (bring-your-own) pre-baked deps cut boot to ~200ms self-host control plane (Apache-2.0) SDK RPC stdout · files
E2B exposes a Firecracker microVM as an SDK object: your app creates it, drives it, and disposes of it.

Isolation model: Firecracker microVMs

E2B's sandboxes are Firecracker microVMs — the same KVM-based, minimal-attack-surface VM technology that AWS Lambda runs on. Each sandbox boots a fresh Linux kernel in roughly 200ms, with its own kernel, init, and filesystem. That is a meaningfully stronger isolation primitive than a shared-kernel container: a sandboxed process cannot exploit a Linux kernel bug to break out into a peer sandbox, because there is no shared kernel to exploit. The agent gets a real /, real /proc, and a real network stack, all behind a hypervisor boundary. That matters when the code being run came from a model that an attacker may have prompt-injected.

Sandbox lifecycle: app-owned

The lifecycle is entirely in your hands. e2b.Sandbox.create() spins up a microVM (default timeout 5 minutes, configurable up to ~24 hours); sandbox.kill() tears it down; sandbox.set_timeout() extends it mid-run; Sandbox.connect(sandbox_id) reattaches across processes so a long-running agent can survive its own restarts. You can keep one sandbox per conversation, pool them across users, or burn one per tool call — E2B does not assume a shape. Templates (preinstalled images) make boots cheap, so the "always create fresh" pattern is workable rather than wasteful. Persistent state lives inside the sandbox's filesystem for its lifetime; if you need it to outlive the VM, you upload, download, or mount it yourself.

Integration shape: SDK primitive

The integration is a library, not a service abstraction. You import the SDK (Python or JS), get a Sandbox object, and call methods on it — run_code, commands.run, files.write, files.read. The agent framework you are using (LangGraph, OpenAI Agents SDK, your own loop) wraps those calls as tools. E2B does not push a particular agent shape on you; it is the layer below your agent loop, just like a database client is. The repo is open source (Apache-2.0), and you can self-host the control plane if you must keep everything inside your own boundary.

Modal — deep dive

Modal architecture Your agent deploys a Modal app — a Python file with @app.function decorators and an optional Sandbox primitive. Modal's platform pools warm gVisor-isolated containers; calls hit a warm container if available, scale up otherwise. The platform owns provisioning, scheduling, and warm pool eviction. YOUR AGENT APP Agent loop tool call → modal function / Sandbox Modal app (Python) @app.function(image=...) def run(code): exec(code) modal.Sandbox.create() deploy once · call many Platform owns lifecycle you write code · Modal runs it Modal platform (managed) Scheduler · queue routes the call Image builder layered Modal Image Warm container pool (gVisor-isolated) Container warm Container warm Container scaling up Sandbox primitive long-lived shell + exec GPU pool A10 / A100 / H100 on demand gVisor (runsc) — userspace kernel intercepts syscalls function call result
Modal serves Python functions as containers and exposes a Sandbox primitive for arbitrary agent-driven code execution.

Isolation model: gVisor-style containers

Modal runs your code in containers rather than VMs, but with a userspace kernel (gVisor / runsc) sitting between the workload and the host kernel. That gives stronger isolation than a vanilla runc container — the syscall surface visible to the workload is intercepted by the gVisor sentry instead of going straight to the host — at a noticeable per-syscall cost. The trade is the inverse of E2B's: containers cold-start faster and pool warmer, but the isolation primitive is a userspace kernel, not a hypervisor. For most agent workloads (running model-generated Python on user data), that is a sensible point on the curve.

Sandbox lifecycle: platform-owned

Modal owns the lifecycle. A @app.function() is a serverless function: each call provisions a container, runs the function, returns the result, and the container goes back to a warm pool to wait for the next call. Scaling, queueing, and resource limits are the platform's job, not yours. The Sandbox primitive — the one agents actually use for arbitrary code — is a sibling abstraction: modal.Sandbox.create() opens a long-lived container the agent can shell into and execute against, with a configurable timeout and the ability to attach more processes mid-run. You still write a Modal app and deploy it; the platform decides which physical host the container lands on and when warm pools recycle.

Integration shape: serverless function + Sandbox

Modal is a compute platform first, sandbox-for-agents second. The headline ergonomics — define a function in Python, decorate it, run it as cloud compute with GPUs available on demand — were built for ML inference and batch jobs; the Sandbox primitive landed later for agent use cases. Practically that means Modal is the strongest of the four for "agent that needs to fine-tune a model, then plot the result" workflows where heavy GPU compute and code-execution-as-tool overlap. It is the weakest of the four if all you want is a one-line "give me a sandbox" SDK call without learning Modal's app / function model.

Daytona — deep dive

Daytona architecture A workspace definition is provisioned as a long-lived container or VM with the project source mounted and the language toolchain installed. The same workspace is exposed to a developer over SSH or the IDE, and to an agent over the Daytona SDK; the workspace filesystem persists across many calls and many sessions. DEVELOPER IDE / SSH VS Code · JetBrains · terminal Live editing edit while agent works AGENT Daytona SDK create · exec · files · stop Definition (yaml) devcontainer / Dockerfile Workspace (long-lived, hours to days) Project source already checked out git history available Toolchain installed node · python · go · cargo per workspace definition Shell · tests · build agent runs commands like a developer would Persistent FS state survives many calls survives many sessions Container or VM (workspace-defined) standard container isolation by default — harden with k8s policies Daytona platform (open source, AGPL) self-host on Kubernetes or use managed cloud workspace = sandbox = dev env (shared design) SSH · IDE SDK calls
Daytona provisions IDE-shaped workspaces from a definition, then exposes the same container to both humans and agents.

Isolation model: container/VM workspace

Daytona's sandbox is a workspace: a container (or, on some deployments, a full VM) provisioned from a workspace definition — devcontainer.json, a Dockerfile, or a Daytona-specific config — with the project's source already checked out and the dev toolchain already installed. The isolation primitive is plain container isolation by default; the security story is closer to "this is your developer environment, not untrusted code" than to E2B's hypervisor boundary. If your agent is doing operations on code your team owns (refactor this repo, run these tests), that matches the threat model; if the agent is running adversarial code from end users, it does not.

Sandbox lifecycle: workspace-shaped, longer-lived

The lifecycle assumption is the inverse of Modal's: workspaces are long-lived, measured in hours and days rather than per-call. A workspace boots once, holds the project state on its filesystem across many tool calls and IDE sessions, and only goes away when explicitly stopped or auto-suspended. That matches the original product — "stop fighting your dev environment, get a reproducible one in 30 seconds" — and it matches a coding agent that runs many edits and tests against the same checkout. The same workspace can be attached by a human developer over SSH/JetBrains/VS Code at the same time, so the agent and the human share state by design.

Integration shape: dev-environment API

Daytona ships as a platform + CLI + SDK. The Daytona platform is open source (AGPL); you can run it on your own Kubernetes or use the managed cloud. The agent integration is an SDK that creates workspaces, runs commands in them, reads and writes files, and tears them down — a thin layer over the same primitives the IDE uses. The shape is right for coding agents and developer-environment-style tasks; it is heavier than necessary for "give me a Python REPL for the next 30 seconds to run a model's plot."

Anthropic Code Execution — deep dive

Anthropic Code Execution architecture A Messages API call declares the code_execution tool in the tools array. The model emits code_execution_tool_use blocks; Anthropic's server runs them in a sandboxed container that persists across turns within a single conversation, up to one hour or about a gigabyte of state, and returns stdout, files, and final text in the response. No sandbox object exists in the client. CLIENT (thin) Messages API call tools: [{ type: "code_execution_20250522" }] No SDK · No sandbox obj declare the tool · send prompt parse content blocks back Receives interleaved text · tool_use · tool_result model plan + code + stdout no infrastructure code Anthropic server (API-owned) Model loop Claude decides when to call code_execution Sandboxed container internals not documented opaque to caller Python 3 + scipy stack numpy · pandas · matplotlib pip install (PyPI) Network policy outbound: docs allowlist inbound: none Container persists across turns within one conversation up to ~1 hr · ~1 GB · Jupyter-kernel-like state vanishes when conversation ages out Outputs flow back as response content blocks stdout · stderr · files (via Files API) POST /v1/messages response · SSE
The container lives inside the Messages API: declare the tool, and the model decides when to run code in it.

Isolation model: server-side sandboxed container

Anthropic's Code Execution tool runs your model-generated Python in a sandboxed container on Anthropic infrastructure. The isolation primitive is not documented as deeply as Firecracker or gVisor — the official description is "secure, sandboxed environment" — but the relevant property is that the container is not yours: you do not provision it, you do not see the host, and you do not pick the region beyond the API endpoint. Network egress is restricted to a documented allowlist (PyPI for installs, a few approved endpoints); inbound network is not available. The agent gets a Jupyter-like Python environment with the scientific stack preinstalled.

Sandbox lifecycle: API-owned, conversation-scoped

The lifecycle is owned entirely by the API. You enable the tool by passing {"type": "code_execution_20250522", "name": "code_execution"} in the tools array of a Messages request; the model then decides when to call it, and Anthropic provisions, reuses, and tears down the container. Crucially, the container persists across turns within the same conversation for up to one hour and ~1 GB of state, so variables and files survive multiple model turns the way a Jupyter kernel does. You cannot extend it, reattach to it from a different conversation, or migrate the state out except by reading files back via the Files API. The container is gone when the conversation ages out.

Integration shape: model-loop tool

The integration is zero-SDK: this is a server-side tool baked into the model loop, exactly like web search or the computer-use tool. You make a Messages API call with the tool declared and your prompt; the response comes back with the model's plan, the code it ran, the stdout it saw, and the final answer interleaved as content blocks. No sandbox object lives in your code. The trade-off is symmetrical to Claude Managed Agents elsewhere in the ecosystem: you give up control of the runtime (region, kernel, package set beyond defaults, network policy) and gain a runtime you do not operate.

Cross-cutting comparison

Isolation primitive

Isolation primitive Four-column comparison of the isolation primitive each sandbox runtime relies on. E2B uses Firecracker microVMs — a hypervisor boundary with no shared kernel. Modal uses gVisor-protected containers, a userspace kernel that intercepts syscalls. Daytona uses standard container or VM workspace isolation, configured for trusted developer code by default. Anthropic Code Execution uses an undocumented server-side sandboxed container. Isolation primitive E2B Firecracker microVM (hypervisor boundary, own kernel per sandbox) strongest Modal gVisor (runsc) userspace kernel intercepts syscalls software boundary Daytona Standard container (or VM) workspace trusted dev code by default harden with policies Anthropic Code Execution "Secure, sandboxed" internals undocumented trust the vendor story opaque
From hypervisor boundary to undocumented server sandbox — the four isolation primitives sit on a spectrum.

This is the axis where the four sit furthest apart, and it is the axis adversarial code-execution scenarios live or die on. E2B is the strongest: Firecracker is a real hypervisor boundary, with no shared kernel between sandboxes — the same primitive AWS chose for Lambda and Fargate. Modal sits a step closer to the host: gVisor's userspace kernel intercepts syscalls, which is meaningfully harder to escape than vanilla containers but is still a software boundary in the host's kernel address space. Daytona is the most relaxed by default — standard container isolation suitable for trusted developer code, not for adversarial user-submitted code — though you can harden it with your own Kubernetes network policies. Anthropic Code Execution is the least transparent: the docs promise a "secure, sandboxed environment" but do not name the primitive, so you are trusting Anthropic's operational story rather than auditing the boundary. Choose the leftmost primitive that still meets your latency and ergonomic budget.

Lifecycle ownership

Lifecycle ownership Four-column comparison of who owns the sandbox lifecycle. E2B is app-owned — your application code calls create, set_timeout, kill, and reattach. Modal is platform-owned — warm pools and per-call provisioning are the platform's job. Daytona is workspace-owned with long-lived sessions measured in hours and days. Anthropic Code Execution is API-owned — the container exists for the duration of a single conversation, then is gone. Lifecycle ownership E2B App-owned create() · kill() set_timeout · reattach forget kill = your bill Modal Platform-owned warm pool · autoscale per-call provisioning leaks are platform's problem Daytona Workspace-owned long-lived (hours, days) survives across sessions shaped like a dev env Anthropic Code Execution API-owned per conversation ~1 hr · ~1 GB zero infra code
Who creates the sandbox, when it dies, and where the state goes — four different owners.

The lifecycle question is "who decides when this container is born and dies, and who is on the hook when it leaks." E2B puts that decision in your application code: you call create, you call kill, you reattach across processes, and a forgotten kill() is a bill you pay. Modal hands it to the platform: warm pools, per-call provisioning, autoscaling, and the leak risk is its problem, not yours. Daytona pushes the lifetime out to the workspace, meaning a sandbox is something more like a dev environment that an agent and a human both share for the duration of a project — a shape that is a great fit for coding agents and a terrible fit for one-shot model tool calls. Anthropic Code Execution collapses the question entirely: the container lives for a conversation, then is gone, and you do not get to argue with that. The right pick depends on whether your agent's natural unit of work is a tool call (Modal, Anthropic), a session (E2B), or a project (Daytona).

What the agent actually sees

What the agent actually sees Four-column comparison of the runtime surface presented to the agent. E2B and Modal expose a full Linux environment with arbitrary network and package install. Daytona exposes a developer workspace with the project source mounted and the language toolchain installed. Anthropic Code Execution exposes a curated Jupyter-like Python environment with the scientific stack preinstalled, pip restricted to PyPI, and outbound network limited to an allowlist. What the agent actually sees E2B Full Linux VM arbitrary apt + pip arbitrary network any language Modal Full container Modal Image layers arbitrary network + GPU any language Daytona Project + toolchain git, shell, tests ready arbitrary network IDE-shaped Anthropic Code Execution Python 3 only scipy stack preinstalled pip → PyPI · no inbound curated · narrow
The surface the model sees: full Linux VM, full container, IDE workspace, or a curated Python environment.

From the agent's point of view — what filesystem, what network, what package manager — the four are very different rooms. E2B and Modal give the agent a full Linux environment with whatever you put in the template image: arbitrary apt install, arbitrary outbound network, GPUs on Modal. Daytona goes further on toolchain: the project source is already mounted, the language runtime is installed per the workspace definition, git and a shell are immediate. Anthropic Code Execution is the most curated: a fixed Python 3 environment with the scientific stack preinstalled (numpy, pandas, scipy, matplotlib, scikit-learn), pip install permitted but restricted to PyPI, no inbound network, outbound limited to a documented allowlist. The implication is direct: if the agent needs to curl a random API or run a TypeScript script, three of these can and Anthropic cannot. If the agent only ever needs to run model-suggested data-analysis Python, the curated environment is the safer default precisely because there is less surface to attack.

When to pick which

Use case Pick E2B if… Pick Modal if… Pick Daytona if… Pick Anthropic Code Execution if…
Running untrusted model code You want the strongest isolation primitive on the list — a real hypervisor boundary, per-sandbox. gVisor isolation is enough for your threat model and you value the warm-pool latency story. Not the natural fit — workspace defaults assume trusted developer code. You want Anthropic to own the sandbox boundary, and a curated Python environment is enough.
Coding agent on a real repo Workable but heavier than needed — you will rebuild what a workspace already gives you. Workable for short-lived test runs; awkward for the long-lived dev-environment shape. You want a checkout + toolchain + IDE-shaped workspace the agent can shell into for hours. No — there is no project checkout, no shell, and the container vanishes per conversation.
One-shot Python in a chat Workable — create, run, kill — but you pay for boot + integration code each time. You like the warm-pool latency for repeated short calls inside your own infra. Overkill — workspace boot is slower than a per-call sandbox. You want it built into the model loop with zero infrastructure code on your side.
GPU-heavy compute alongside execution Possible with a custom template, but not the strongest path. Strongest fit — Modal's GPU story is its origin, and the Sandbox sits alongside it. Possible but workspace-shaped; not a serverless GPU primitive. No — no GPU access, no custom compute beyond the curated Python image.
Self-host inside your boundary Yes — open source, you can run the control plane on your own infra. No — Modal is a hosted platform. Yes — Daytona is open source (AGPL) and runs on your Kubernetes. No — by design, the container lives on Anthropic infrastructure.

FAQ

What's the difference between E2B and Anthropic Code Execution?

They sit at opposite ends of the lifecycle-ownership spectrum. E2B is an SDK: your app calls e2b.Sandbox.create() to spin up a Firecracker microVM, drives it with method calls, and explicitly kills it — every lifecycle decision is in your code, and you can self-host the control plane. Anthropic Code Execution is a server-side tool inside the Messages API: declare it in the tools array and the model decides when to run code in a container that Anthropic provisions, persists for one conversation, and tears down. Reach for E2B when you want a strong isolation primitive on infrastructure you control; reach for Anthropic Code Execution when you want zero infrastructure code and a curated Python environment built into the model loop.

Is Modal a sandbox runtime or a serverless platform?

Both, but it started as the latter. Modal's original ergonomics — decorate a Python function, run it as serverless cloud compute with GPUs available on demand — were built for ML inference and batch jobs. The Sandbox primitive, which is the one agents actually use for arbitrary model-generated code, is a sibling abstraction added later. That matters when picking: Modal is the strongest of the four if your agent needs to fine-tune a model and then run code against the result; it is heavier than the others if all you want is a one-line "give me a Python REPL" SDK.

Can Daytona run untrusted code from end users?

By default, treat it as no. Daytona's threat model is the developer experience — give the agent and the developer a reproducible workspace with the project checked out and the toolchain installed — and the isolation primitive is standard container isolation. That is fine for code your team owns or that an agent generates against your own repo, but it is not the right boundary for adversarial user-submitted code. If you do need to run untrusted code through Daytona, lean on Kubernetes-level network policies, per-tenant clusters, or a stronger runtime like gVisor underneath, and audit the configuration carefully.

How long does the Anthropic Code Execution container persist?

Up to one hour and ~1 GB of state, scoped to a single conversation. Within those bounds the container behaves like a long-lived Jupyter kernel: variables, files, and imports survive across model turns, so the model can compute a dataframe in one turn and plot it in the next. The container is gone when the conversation ages out, and you cannot extend it or reattach from a different conversation — if you need long-lived per-user state, you must read files back via the Files API and re-upload them on the next conversation.

Why would I pick a microVM over a container?

Stronger isolation per sandbox, at the cost of slightly higher cold-start. A Firecracker microVM has its own kernel, init, and address space behind a hypervisor — a sandboxed process cannot exploit a Linux kernel bug to break out into a peer sandbox, because there is no shared kernel to exploit. A standard container shares the host kernel; even a gVisor-protected container is a software boundary in the host kernel's address space. For agent workloads where the code came from a model an attacker may have prompt-injected, the hypervisor boundary is the conservative choice. For trusted code, the container is faster and cheaper.

When should I write my own sandbox instead of using one of these?

Almost never, and the smaller the team the more never it is. The hard parts of a code-execution runtime — strong isolation, package install, file IO, stdout streaming, idle eviction, billing, multi-tenant security — are the entire product of these four projects, and they have absorbed years of failure modes you would otherwise rediscover. The realistic write-your-own threshold is "we have a regulatory boundary that none of these meet" or "our existing platform already gives us a container primitive we trust." See tools, actions, and environments for why the environment is usually where agents become dangerous — picking the right runtime is most of that defense.

Further reading

On this wiki:

  • Tools, actions, and environments — why the environment, not the model, is usually where agents become dangerous, and the threat model these runtimes are answering.
  • Tool calling explained — the wire-level shape of a tool call, which is what each of these four runtimes ultimately gets wired into.
  • The Agent Loop — the perceive-decide-act cycle that wraps every code-execution sandbox, made explicit so you can see where the runtime sits.
  • Agentic risks intro — the risk taxonomy that decides how strong an isolation primitive you actually need.

Project sources: