LangSmith vs Braintrust vs Helicone vs Arize Phoenix: Four Loops the Eval/Observability Stack Was Built to Close

Q: Can I run Braintrust evals locally without paying for the SaaS?

Yes — the SDK and the autoevals scorer library are open source, and braintrust eval ./eval.ts will run your eval suite and print scores from a laptop or CI runner. What you lose without the SaaS backend is the persisted experiment store, the side-by-side regression diff UI, the playground for prompt iteration, and the online logging path that feeds production traces back into your dataset. The pitch is that those persisted artifacts are what make the CI loop close — local-only is fine for the first eval, less fine for the tenth pull request.

Ship an agent without measuring it and you are flying with the instrument panel taped over: the first sign that quality has cratered is a user complaint, the first sign that cost has tripled is the invoice. All four of these platforms ship the same three primitives — traces, datasets, evaluators — and their feature lists nearly match. What separates them is invisible on those lists: which feedback loop each one was designed to close. LangSmith closes the LangChain/LangGraph dev loop. Braintrust closes the CI loop with Eval-as-code. Helicone closes the production gateway loop without touching your SDK. Arize Phoenix closes the OpenTelemetry-native monitoring loop and brings the ML-observability drift tradition with it.

At a glance

Four products, four answers to the same question — at which point in the lifecycle does the team actually look at this data and act on it. The table below sets the basics; the matrix that follows shows where each one leans hardest across the axes that actually differ.

Platform	Released / maintainer	Primary niche	OSS vs SaaS
LangSmith	2023, LangChain Inc.	LangChain/LangGraph dev loop — prompts, datasets, online evals	SaaS-first (paid self-host tier)
Braintrust	2023, Braintrust Data Inc.	Evals as a first-class CI artifact, regression diffs	SaaS (enterprise self-host)
Helicone	2023, Helicone Inc.	Gateway-first production observability + cost	Apache-2.0 OSS (hosted available)
Arize Phoenix	2023, Arize AI	OpenTelemetry-native LLM observability + drift	Elastic-2.0 OSS (Phoenix Cloud / Arize AX)

Snapshot: 2026-06-01. These platforms ship frequently; verify against current docs.

Where each platform leans hardest. Each one has exactly one column where it sits in solid accent — that column is the loop it was designed to close.

LangSmith — deep dive

LangSmith centers the dev loop: traces and runs feed datasets, datasets feed evals, the Playground pushes a winning prompt back to the app.

Data model — runs, traces, and prompt-versioned datasets

A LangSmith run is a tree of nested child runs — one node per LLM call, tool call, retriever invocation, or chain step — with inputs, outputs, latency, token usage, cost, and free-form metadata on each node. The same schema describes a unit test, a CI evaluation, and a production request, which is what lets a flagged production trace get promoted into a dataset row without remodelling anything. Datasets are versioned and forkable, so "the golden 200 examples we run prompt changes against" is a first-class object you can pin to a commit.

The loop it closes — prompt hub → dataset → online eval → prompt hub

LangSmith's centre of gravity is the prompt-iteration loop for teams building on LangChain or LangGraph. You write a prompt in the Prompt Hub, run it against a dataset in the Playground, eyeball per-example diffs, score with built-in or custom evaluators, then promote the winning version into the app. Online evaluators close the back half of the loop: sampled production traces get scored asynchronously, and regressions surface as new dataset rows. The same shape supports a managed LangGraph runtime under the LangSmith Deployment umbrella, so the dev loop and the prod loop share a control plane. The pull is strongest if your app is already framework-shaped — graphs, chains, agents wired through the LangChain stack get instrumented for free.

Integration shape — SDK callbacks plus OTel support

The primary integration is the LangChain tracer callback: if you already use LangChain or LangGraph, instrumentation is one env var. Outside that, the @traceable decorator wraps arbitrary Python or JS functions into LangSmith spans, and the platform also ingests OpenTelemetry data via OTLP, so a team standardising on OTel is not locked out. The trade-off is real though: outside the LangChain world you get traces, but you give up some of the framework-aware UI affordances — agent-step grouping, prompt-version linkage — that are the reason to pick LangSmith in the first place.

Braintrust — deep dive

Braintrust treats evals as a CI artifact: Eval() in code, run on every PR, diffed against the baseline experiment.

Data model — Eval-as-code and immutable experiments

The unit you author in Braintrust is an Eval() — a function that pairs a dataset, a task (your prompt or chain), and a list of scorers, all written in TypeScript or Python and checked into your repo. A run of that Eval() produces an experiment: an immutable build artifact stitching every example, every model output, every scorer score, and aggregate metrics into one object you can permalink, compare, and store next to the commit that produced it. Datasets and scorers are first-class; both are versioned. The autoevals library ships a battery of LLM-judges, classification, and similarity scorers that you import like any other dependency.

The loop it closes — every pull request is an eval run

This is the CI loop. braintrust eval ./eval.ts runs in GitHub Actions on every PR, the resulting experiment is diffed against the baseline experiment on main, and the per-example regressions show up as a PR comment with side-by-side outputs. A drop in accuracy on the golden set, a spike in cost-per-example, a latency regression — any of them can fail the check and block the merge. That is a different muscle than browsing a trace dashboard after the fact: it forces a per-PR answer to "did this change help or hurt," exactly the muscle that Evals 101 calls the bare minimum for non-toy LLM work. The principle scales down to RAG too — see Evaluating RAG for what an eval set for retrieval/grounding/answer-quality actually looks like.

Integration shape — wrappers and online logging

Instrumentation is opt-in code rather than auto-magic: wrapOpenAI() wraps a client, @traced wraps a function, and spans nest naturally. There is online logging too — production calls stream into Braintrust and can seed new dataset rows, so a thumbs-down in the UI becomes the next regression test on the next PR. But the centre of gravity is local-first: evals run on your laptop and in CI before they run in prod, which is the inverse of a gateway-first product.

Helicone — deep dive

Helicone sits in front of the model provider as a transparent proxy: change base_url, get traces, cost, caching, and replay.

Data model — proxied requests as the unit of observation

The Helicone unit is an HTTP request to the model provider, captured at the gateway. Each row carries the full prompt, the full response, the streaming chunks, latency, token counts, computed cost, model, provider, user/session/request tags, and any custom headers you set. Multi-step agents and chains group via the session ID — a header you set per logical "conversation" or "agent run" — so a multi-tool agent loop shows up as a session tree instead of forty unrelated requests. There is no graph-of-nodes abstraction baked in; the gateway sees what the gateway sees.

The loop it closes — production observability without an SDK migration

Helicone's pitch is the smallest possible incremental cost to get a real dashboard onto a system that is already in production. Point your OpenAI or Anthropic client at oai.helicone.ai/v1 (or the Anthropic equivalent), add a Helicone-Auth header, ship. Now cost-per-user, p95 latency, error rates, prompt-level slow queries, and cache-hit ratio land in a dashboard that day. Caching, retry, rate limit per user/key, PII filtering, and the AI Gateway's cross-provider routing all sit at the same layer — they are gateway features, not observability features, but they share the proxy. The trade-off is the inverse of LangSmith's: you give up framework-awareness (no node spans, no prompt-version linkage) and get back zero migration cost. Replay and experiments let you re-run a captured prompt with a new model or new template, scored by an evaluator, but the gravitational centre is "what is my prod doing right now."

Integration shape — gateway proxy first, OTel optional

The proxy is the wire-shape, full stop. Async logging via SDK is supported for cases where you cannot reroute traffic, and the project does emit and accept some OTel signals, but the canonical install is "change one URL." For teams whose cost, quality, latency conversation is happening at the platform-bills level — finance asks "where did the $40k go," ops asks "why did the 99th percentile blow up at 14:00" — Helicone is the lowest-friction way to answer those without changing application code.

Arize Phoenix — deep dive

Phoenix is OTel-native: any framework that emits OpenInference spans lands in the same server, alongside drift and embedding views inherited from Arize's ML-obs lineage.

Data model — OpenInference spans on top of OpenTelemetry

Phoenix does not invent a proprietary trace schema. The data model is OpenInference, an open spec layered on OpenTelemetry that defines span attributes for LLMs, retrievers, embeddings, tool calls, and agent steps. The instrumentors are OTel auto-instrumentors — point an OTel SDK at Phoenix's OTLP endpoint and a LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, or Bedrock call shows up as a structured LLM span without you writing wrappers. The implication is portability: the same instrumented app can ship spans to Phoenix today, to Jaeger or Tempo or Datadog tomorrow, without re-instrumenting. That property does not exist in the other three.

The loop it closes — OTel-native monitoring with ML-obs drift

Phoenix inherits the ML-observability tradition from its parent Arize. That means drift is not an afterthought: embedding-similarity drift on RAG inputs, retrieval-quality drift over time, UMAP clusters of model outputs, and per-cohort regression are first-class views — the kind of monitoring ML teams have done on tabular models for a decade, ported to the LLM era. phoenix.evals is a Python library for running offline evaluators (LLM-judge, hallucination, retrieval quality, custom) over stored traces, and it is comfortable in a notebook. The loop closed is the "monitor in prod, notice drift, run a notebook eval, file a ticket" loop, not the "block the PR" loop. For RAG specifically, the embedding-drift view is the one piece Evaluating RAG describes as hard to bolt on later.

Integration shape — OpenTelemetry, OSS, self-host first

Arize Phoenix is the most genuinely open of the four: Elastic-2.0 licensed, runs from pip install arize-phoenix in a notebook, scales to a Docker/Kubernetes deployment, with Phoenix Cloud as a hosted convenience and Arize AX as the paid production tier when you outgrow the OSS server. The OTel-native design is the lock-in story in reverse — your instrumentation is portable by construction, and Phoenix's value lands on top of that portability rather than under it. The trade-off is workflow density: Phoenix gives you the primitives but it does not push you toward a CI workflow the way Braintrust does or a Prompt Hub workflow the way LangSmith does. You assemble the loop yourself.

Cross-cutting comparison

Instrumentation shape — SDK vs proxy vs OTel

How the data physically arrives differs more than the dashboards on top of it.

Four products, three architectures for how the trace data leaves your process. LangSmith and Braintrust both hand you an SDK, but they sit on opposite ends of "auto vs explicit": LangSmith is framework-coupled (LangChain's tracer callback fires for free; the @traceable decorator covers the rest), while Braintrust is wrapper-coupled (you opt in with wrapOpenAI or @traced, and the win is that an eval is just another function with those wrappers). Helicone takes the opposite path entirely: the proxy sits in front of every request, so instrumentation is a base_url change rather than a code change, and the trade-off is that you only see what crosses the wire. Phoenix moves outside the proprietary-SDK frame altogether by speaking OpenTelemetry/OpenInference, which is the only one of the four where the instrumentation outlives the vendor — re-point the OTel exporter and the same spans go anywhere. If your team standardises on OTel for the rest of the stack, that asymmetry is decisive.

Evaluation model — offline vs online vs CI

Where in the lifecycle "did this get worse?" is supposed to be answered.

All four can run an LLM-judge against a dataset. They diverge on which moment of the lifecycle the eval is meant to live in. Braintrust is the strongest opinion: an eval is code, it runs on every PR, and a regression blocks merge — the answer to "did this get worse" arrives before the change ships. LangSmith straddles dev and prod through the Prompt Hub plus online evaluators that sample production traces and surface regressions after the fact. Helicone anchors its evals to captured production requests — its replay/experiments flow re-runs a real prod prompt with a new template, which is closer to a post-hoc what-if than a gating test. Phoenix runs evaluators as a Python library over OTel traces (comfortable in a notebook, less so as a CI gate) and adds the embedding-drift view as a continuous monitor rather than a discrete test. None of these is wrong; they answer the question at different points, and the right pick depends on whether your bug is "we regressed on a known dataset" (Braintrust), "production looks weird and I want to know why" (Helicone, Phoenix), or "the prompt change I'm about to ship is risky" (LangSmith).

Open-source vs SaaS — and who holds the data

OSS posture also decides who holds your prompts, your outputs, and your eval datasets.

This axis splits cleanly into two pairs. LangSmith and Braintrust are SaaS-first products with paid self-host tiers for enterprises that cannot send prompts to a vendor — the OSS components are SDKs, not the backend that stores your data. Helicone and Phoenix are genuinely open source: Helicone's gateway is Apache-2.0 and you can run the full stack on your own boxes; Phoenix is Elastic-2.0 and is designed to run anywhere, from a notebook cell to a Kubernetes deployment. For a regulated workload — health, finance, government — that pair is the natural starting point, with Phoenix's OTel-native design and Helicone's gateway shape covering different halves of the problem. For everyone else the trade is convenience: SaaS-first products move faster on UI polish and online-eval features; OSS products move faster on portability and "we own the data plane."

When to pick which

Use case	Pick LangSmith if…	Pick Braintrust if…	Pick Helicone if…	Pick Arize Phoenix if…
Tightening a prompt-iteration loop	You live in LangChain/LangGraph and want a Prompt Hub + Playground + online evals as one product.	You want every prompt change to ship through CI with a regression diff before it merges.	Not the natural fit — Helicone watches what production does, not what dev is about to ship.	You will assemble the loop yourself in a notebook over OTel traces.
Making evals fail PRs	Possible via SDK + CI, but the workflow is not as opinionated.	This is the entire pitch — `Eval()` in TS/Py, regression diff, blocking PR check.	Replay/experiments are post-hoc, not a CI gate.	You can wire `phoenix.evals` into CI, but it is a library, not a workflow.
Getting cost + latency telemetry on prod today	Possible, but instrumentation requires SDK changes.	Possible via wrappers, but the centre of gravity is offline runs.	Yes — point `base_url` at the gateway and the dashboards light up in minutes.	Yes if you already run OTel; the OpenInference instrumentors give you cost + latency out of the box.
Monitoring embedding/RAG drift	Limited — not the niche.	Limited — not the niche.	Limited — gateway-level metrics, not embedding-space metrics.	This is the inherited Arize muscle — UMAP, embedding drift, retrieval drift.
Self-host + own the data plane	Available only on the paid enterprise tier.	Available only on the paid enterprise tier.	Yes — Apache-2.0, run docker-compose anywhere.	Yes — Elastic-2.0, runs in a notebook through to Kubernetes.
Vendor-neutral instrumentation	OTel supported, but the value is LangChain-shaped.	SDK-coupled.	Proxy is portable, but data lives in Helicone-shaped tables.	OpenInference + OTel by construction — re-point the exporter, keep the spans.

FAQ

Do I have to use LangChain to use LangSmith?

No, but you give up most of the LangChain-shaped UI affordances. The platform ingests traces from arbitrary Python or JavaScript code via the @traceable decorator and also accepts OpenTelemetry data over OTLP, so a non-LangChain stack can use LangSmith for traces, datasets, and evaluators. The reason to pick LangSmith over the alternatives, though, is the framework-aware view — agent-step grouping, prompt-version linkage in the Prompt Hub, graph trace visualisations — which are most valuable when your app is already a LangChain or LangGraph graph.

Can I run Braintrust evals locally without paying for the SaaS?

Yes — the SDK and the autoevals scorer library are open source, and braintrust eval ./eval.ts will run your eval suite and print scores from a laptop or CI runner. What you lose without the SaaS backend is the persisted experiment store, the side-by-side regression diff UI, the playground for prompt iteration, and the online logging path that feeds production traces back into your dataset. The pitch is that those persisted artifacts are what make the CI loop close — local-only is fine for the first eval, less fine for the tenth pull request.

Is Helicone really just a proxy? What if I cannot route traffic through a vendor?

The canonical install is the proxy, but it is not the only one. Helicone also offers an async logging SDK that captures requests in your process and streams them to the platform without rerouting traffic, which suits cases where the vendor URL is fixed or compliance forbids the redirect. The full Helicone stack including the gateway is Apache-2.0, so the strongest version of "I cannot send this to a vendor" is to self-host the gateway entirely — same wire shape, your own cluster.

What is the difference between Arize Phoenix and Arize AX?

Phoenix is the open-source LLM-observability project — Elastic-2.0, free, self-hostable, with Phoenix Cloud as a managed hosted version. Arize AX is the paid production tier from the same company: it inherits the OpenInference data model and the embedding-drift heritage, and adds the scale, alerting, role-based access control, and the broader ML-observability surface for non-LLM models that production teams typically buy. The graduation path is intentional — start on OSS Phoenix in dev, migrate to AX when production scale or governance demands it, without re-instrumenting because both speak OpenTelemetry.

Which of these supports OpenTelemetry?

All four claim some support; the depth differs sharply. Arize Phoenix is OTel-native by construction — OpenInference is the data model and the instrumentors are OTel auto-instrumentors, so the same spans can ship to any OTel backend. LangSmith and Braintrust both accept OTLP data, which is enough to use them as the trace sink in an OTel pipeline, but neither was designed OTel-first and their richest features assume their own SDK shape. Helicone is gateway-first; OTel is a secondary path. If "instrumentation must survive a future vendor switch" is on your checklist, Phoenix is the only answer that does not require an asterisk.

Do I need an evals platform at all if I have a few unit tests?

Unit tests cover deterministic code; LLM behaviour is not deterministic. Even a tiny LLM-judge over a hand-curated golden set will catch prompt regressions that unit tests cannot see, and the cost of running one is hours, not weeks — see Evals 101 for the minimum-viable setup. The case for picking a platform over rolling your own grows with the number of evaluators you maintain, the number of people changing prompts, and the need to compare runs side-by-side rather than as scrollback. Below that bar, a CSV of test cases and a Python script is honest work.