AI Blog

LangSmith vs Braintrust vs Helicone vs Arize Phoenix: Four Loops the Eval/Observability Stack Was Built to Close

All four ship traces, datasets, and evaluators — the feature lists nearly match. What separates them is which feedback loop they were built to close: the dev loop, CI, the production gateway, or model-monitoring drift.

By Agentic AI Wiki 24 min read

Ship an agent without measuring it and you are flying with the instrument panel taped over: the first sign that quality has cratered is a user complaint, the first sign that cost has tripled is the invoice. All four of these platforms ship the same three primitives — traces, datasets, evaluators — and their feature lists nearly match. What separates them is invisible on those lists: which feedback loop each one was designed to close. LangSmith closes the LangChain/LangGraph dev loop. Braintrust closes the CI loop with Eval-as-code. Helicone closes the production gateway loop without touching your SDK. Arize Phoenix closes the OpenTelemetry-native monitoring loop and brings the ML-observability drift tradition with it.

At a glance

Four products, four answers to the same question — at which point in the lifecycle does the team actually look at this data and act on it. The table below sets the basics; the matrix that follows shows where each one leans hardest across the axes that actually differ.

Platform Released / maintainer Primary niche OSS vs SaaS
LangSmith 2023, LangChain Inc. LangChain/LangGraph dev loop — prompts, datasets, online evals SaaS-first (paid self-host tier)
Braintrust 2023, Braintrust Data Inc. Evals as a first-class CI artifact, regression diffs SaaS (enterprise self-host)
Helicone 2023, Helicone Inc. Gateway-first production observability + cost Apache-2.0 OSS (hosted available)
Arize Phoenix 2023, Arize AI OpenTelemetry-native LLM observability + drift Elastic-2.0 OSS (Phoenix Cloud / Arize AX)

Snapshot: 2026-06-01. These platforms ship frequently; verify against current docs.

Evals + observability feature matrix Heatmap comparing LangSmith, Braintrust, Helicone, and Arize Phoenix across six axes: Dev-loop fit, CI eval workflow, Production telemetry, Drift/monitoring, OTel-native instrumentation, and Open-source self-host. Strength shown by fill from light (weak) to dark accent (strong). Evals + observability feature matrix Dev-loop fit CI eval workflow Prod telemetry Drift / monitoring OTel- native OSS self- host LangSmith Hero SDK + UI Online evals Limited Supports SaaS-first Braintrust Playground Hero Online logs Limited OTLP in SaaS-first Helicone Replay Experiments Hero Cost + latency Gateway-first Apache-2 Arize Phoenix Notebook Local evals Self-host Hero (Arize) OpenInference Elastic-2 Weak Medium Strong
Where each platform leans hardest. Each one has exactly one column where it sits in solid accent — that column is the loop it was designed to close.

LangSmith — deep dive

LangSmith architecture A LangChain or LangGraph app sends traces and runs to LangSmith, where the dev loop closes: prompts and datasets, offline evals, online evals on production traces, and a Playground that pushes prompt changes back into the app. Your app LangChain / LangGraph (or any SDK via tracing) Prompt & chain code edit · commit · deploy (prompt hub pulls in) LangSmith — Dev Loop Traces & Runs tree of LLM / tool / chain calls inputs · outputs · latency · cost tags + feedback annotations Datasets examples promoted from traces versioned · splits · golden sets power offline runs Offline Evaluators LLM-judge · heuristic · custom run a chain over a dataset Online Evaluators score live production traces sampled · async Prompt Hub + Playground edit prompt · re-run dataset · compare versions "the dev loop closes here" promote winning prompt → app LangSmith Deployment (managed LangGraph runtime) traces prompt push deploy
LangSmith centers the dev loop: traces and runs feed datasets, datasets feed evals, the Playground pushes a winning prompt back to the app.

Data model — runs, traces, and prompt-versioned datasets

A LangSmith run is a tree of nested child runs — one node per LLM call, tool call, retriever invocation, or chain step — with inputs, outputs, latency, token usage, cost, and free-form metadata on each node. The same schema describes a unit test, a CI evaluation, and a production request, which is what lets a flagged production trace get promoted into a dataset row without remodelling anything. Datasets are versioned and forkable, so "the golden 200 examples we run prompt changes against" is a first-class object you can pin to a commit.

The loop it closes — prompt hub → dataset → online eval → prompt hub

LangSmith's centre of gravity is the prompt-iteration loop for teams building on LangChain or LangGraph. You write a prompt in the Prompt Hub, run it against a dataset in the Playground, eyeball per-example diffs, score with built-in or custom evaluators, then promote the winning version into the app. Online evaluators close the back half of the loop: sampled production traces get scored asynchronously, and regressions surface as new dataset rows. The same shape supports a managed LangGraph runtime under the LangSmith Deployment umbrella, so the dev loop and the prod loop share a control plane. The pull is strongest if your app is already framework-shaped — graphs, chains, agents wired through the LangChain stack get instrumented for free.

Integration shape — SDK callbacks plus OTel support

The primary integration is the LangChain tracer callback: if you already use LangChain or LangGraph, instrumentation is one env var. Outside that, the @traceable decorator wraps arbitrary Python or JS functions into LangSmith spans, and the platform also ingests OpenTelemetry data via OTLP, so a team standardising on OTel is not locked out. The trade-off is real though: outside the LangChain world you get traces, but you give up some of the framework-aware UI affordances — agent-step grouping, prompt-version linkage — that are the reason to pick LangSmith in the first place.

Braintrust — deep dive

Braintrust architecture Braintrust treats evals as code: developers write Eval() definitions in TypeScript or Python, CI runs them on every PR, results are diffed against the baseline, and a regression blocks the merge — the CI loop is the hero. eval.ts / eval.py Eval(data, task, scorers) code-first, in your repo Pull Request git diff · CI trigger CI runner braintrust eval ./eval.ts runs full eval suite Braintrust — CI Eval Loop Datasets versioned · forked per PR golden + edge cases Scorers (code) autoevals · LLM-judge · custom TypeScript / Python functions Experiments one run = dataset × task × scorers stored as immutable build artifact Regression Diff vs Baseline side-by-side per-example deltas summary: accuracy + cost + latency Online Logs + Playground production traces feed back as new dataset rows prompt iteration in the UI push run pass / fail · PR check
Braintrust treats evals as a CI artifact: Eval() in code, run on every PR, diffed against the baseline experiment.

Data model — Eval-as-code and immutable experiments

The unit you author in Braintrust is an Eval() — a function that pairs a dataset, a task (your prompt or chain), and a list of scorers, all written in TypeScript or Python and checked into your repo. A run of that Eval() produces an experiment: an immutable build artifact stitching every example, every model output, every scorer score, and aggregate metrics into one object you can permalink, compare, and store next to the commit that produced it. Datasets and scorers are first-class; both are versioned. The autoevals library ships a battery of LLM-judges, classification, and similarity scorers that you import like any other dependency.

The loop it closes — every pull request is an eval run

This is the CI loop. braintrust eval ./eval.ts runs in GitHub Actions on every PR, the resulting experiment is diffed against the baseline experiment on main, and the per-example regressions show up as a PR comment with side-by-side outputs. A drop in accuracy on the golden set, a spike in cost-per-example, a latency regression — any of them can fail the check and block the merge. That is a different muscle than browsing a trace dashboard after the fact: it forces a per-PR answer to "did this change help or hurt," exactly the muscle that Evals 101 calls the bare minimum for non-toy LLM work. The principle scales down to RAG too — see Evaluating RAG for what an eval set for retrieval/grounding/answer-quality actually looks like.

Integration shape — wrappers and online logging

Instrumentation is opt-in code rather than auto-magic: wrapOpenAI() wraps a client, @traced wraps a function, and spans nest naturally. There is online logging too — production calls stream into Braintrust and can seed new dataset rows, so a thumbs-down in the UI becomes the next regression test on the next PR. But the centre of gravity is local-first: evals run on your laptop and in CI before they run in prod, which is the inverse of a gateway-first product.

Helicone — deep dive

Helicone architecture Helicone is gateway-first: your app points its OpenAI/Anthropic base URL at Helicone, every request flows through as a transparent proxy, and traces, cost, latency, caching, and rate limits land in the dashboard with zero SDK changes. Your app openai · anthropic · etc. base_url = "oai.helicone.ai/v1" + Helicone-Auth header Helicone Gateway (Proxy) Transparent HTTP proxy forwards · streams · retries no SDK change required Cache deterministic replay cost savings Rate limit + retry per-user / per-key PII filter · keys vault Traces · Sessions · Cost every request + response captured grouped by session_id · user_id Experiments · Replay · Evaluators re-run a captured prompt score with LLM-judge / custom LLM provider OpenAI · Anthropic Gemini · Bedrock Together · OpenRouter … (Helicone AI Gateway can route across providers) request forward response
Helicone sits in front of the model provider as a transparent proxy: change base_url, get traces, cost, caching, and replay.

Data model — proxied requests as the unit of observation

The Helicone unit is an HTTP request to the model provider, captured at the gateway. Each row carries the full prompt, the full response, the streaming chunks, latency, token counts, computed cost, model, provider, user/session/request tags, and any custom headers you set. Multi-step agents and chains group via the session ID — a header you set per logical "conversation" or "agent run" — so a multi-tool agent loop shows up as a session tree instead of forty unrelated requests. There is no graph-of-nodes abstraction baked in; the gateway sees what the gateway sees.

The loop it closes — production observability without an SDK migration

Helicone's pitch is the smallest possible incremental cost to get a real dashboard onto a system that is already in production. Point your OpenAI or Anthropic client at oai.helicone.ai/v1 (or the Anthropic equivalent), add a Helicone-Auth header, ship. Now cost-per-user, p95 latency, error rates, prompt-level slow queries, and cache-hit ratio land in a dashboard that day. Caching, retry, rate limit per user/key, PII filtering, and the AI Gateway's cross-provider routing all sit at the same layer — they are gateway features, not observability features, but they share the proxy. The trade-off is the inverse of LangSmith's: you give up framework-awareness (no node spans, no prompt-version linkage) and get back zero migration cost. Replay and experiments let you re-run a captured prompt with a new model or new template, scored by an evaluator, but the gravitational centre is "what is my prod doing right now."

Integration shape — gateway proxy first, OTel optional

The proxy is the wire-shape, full stop. Async logging via SDK is supported for cases where you cannot reroute traffic, and the project does emit and accept some OTel signals, but the canonical install is "change one URL." For teams whose cost, quality, latency conversation is happening at the platform-bills level — finance asks "where did the $40k go," ops asks "why did the 99th percentile blow up at 14:00" — Helicone is the lowest-friction way to answer those without changing application code.

Arize Phoenix — deep dive

Arize Phoenix architecture Arize Phoenix is OpenTelemetry-native: instrumentors emit OpenInference spans from any framework (LangChain, LlamaIndex, OpenAI), an OTLP collector ingests them into Phoenix, where traces, datasets, evaluators, and embedding drift sit alongside one another in a self-hostable OSS server. Your app LangChain · LlamaIndex OpenAI · Anthropic · Bedrock OpenInference auto-instrumentors emit OTel spans vendor-neutral schema OTLP exporter gRPC / HTTP · OpenTelemetry Phoenix (OSS self-host or Cloud) OTel-compatible ingest accepts OTLP from any OTel client Traces RAG · agent · LLM spans SQLite / Postgres backend queryable via SDK Datasets + Evaluators phoenix.evals (Python lib) notebook-friendly runs offline against traces Embedding + Vector Drift UMAP clustering · embedding similarity drift RAG retrieval quality monitoring Arize AX (paid prod tier) scaling, drift dashboards, alerts graduates Phoenix into ML-obs heritage OTLP spans
Phoenix is OTel-native: any framework that emits OpenInference spans lands in the same server, alongside drift and embedding views inherited from Arize's ML-obs lineage.

Data model — OpenInference spans on top of OpenTelemetry

Phoenix does not invent a proprietary trace schema. The data model is OpenInference, an open spec layered on OpenTelemetry that defines span attributes for LLMs, retrievers, embeddings, tool calls, and agent steps. The instrumentors are OTel auto-instrumentors — point an OTel SDK at Phoenix's OTLP endpoint and a LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, or Bedrock call shows up as a structured LLM span without you writing wrappers. The implication is portability: the same instrumented app can ship spans to Phoenix today, to Jaeger or Tempo or Datadog tomorrow, without re-instrumenting. That property does not exist in the other three.

The loop it closes — OTel-native monitoring with ML-obs drift

Phoenix inherits the ML-observability tradition from its parent Arize. That means drift is not an afterthought: embedding-similarity drift on RAG inputs, retrieval-quality drift over time, UMAP clusters of model outputs, and per-cohort regression are first-class views — the kind of monitoring ML teams have done on tabular models for a decade, ported to the LLM era. phoenix.evals is a Python library for running offline evaluators (LLM-judge, hallucination, retrieval quality, custom) over stored traces, and it is comfortable in a notebook. The loop closed is the "monitor in prod, notice drift, run a notebook eval, file a ticket" loop, not the "block the PR" loop. For RAG specifically, the embedding-drift view is the one piece Evaluating RAG describes as hard to bolt on later.

Integration shape — OpenTelemetry, OSS, self-host first

Arize Phoenix is the most genuinely open of the four: Elastic-2.0 licensed, runs from pip install arize-phoenix in a notebook, scales to a Docker/Kubernetes deployment, with Phoenix Cloud as a hosted convenience and Arize AX as the paid production tier when you outgrow the OSS server. The OTel-native design is the lock-in story in reverse — your instrumentation is portable by construction, and Phoenix's value lands on top of that portability rather than under it. The trade-off is workflow density: Phoenix gives you the primitives but it does not push you toward a CI workflow the way Braintrust does or a Prompt Hub workflow the way LangSmith does. You assemble the loop yourself.

Cross-cutting comparison

Instrumentation shape — SDK vs proxy vs OTel

Instrumentation shape Four-column comparison of how each platform gets data: LangSmith via LangChain-native SDK callbacks, Braintrust via TypeScript/Python wrappers around model calls, Helicone via a transparent HTTP proxy (no SDK change), and Arize Phoenix via OpenTelemetry/OpenInference instrumentors. Instrumentation shape LangSmith LangChain-native tracing callback; @traceable decorator for arbitrary Python framework-aware graph + node spans Braintrust SDK wrappers around model calls: wrapOpenAI() · @traced code-first eval harness is the primary surface Helicone Transparent HTTP proxy. Change base_url, add a header. No SDK swap. Works with any client. Arize Phoenix OpenTelemetry + OpenInference auto-instrumentors. Vendor-neutral spans; any OTel backend can also receive them.
How the data physically arrives differs more than the dashboards on top of it.

Four products, three architectures for how the trace data leaves your process. LangSmith and Braintrust both hand you an SDK, but they sit on opposite ends of "auto vs explicit": LangSmith is framework-coupled (LangChain's tracer callback fires for free; the @traceable decorator covers the rest), while Braintrust is wrapper-coupled (you opt in with wrapOpenAI or @traced, and the win is that an eval is just another function with those wrappers). Helicone takes the opposite path entirely: the proxy sits in front of every request, so instrumentation is a base_url change rather than a code change, and the trade-off is that you only see what crosses the wire. Phoenix moves outside the proprietary-SDK frame altogether by speaking OpenTelemetry/OpenInference, which is the only one of the four where the instrumentation outlives the vendor — re-point the OTel exporter and the same spans go anywhere. If your team standardises on OTel for the rest of the stack, that asymmetry is decisive.

Evaluation model — offline vs online vs CI

Evaluation model Four-column comparison of how each platform models evaluation: LangSmith centers offline + online evals around the Prompt Hub dev loop, Braintrust treats Eval() as a first-class CI artifact with regression diffs, Helicone offers replay + experiments anchored to captured production traces, Arize Phoenix runs notebook-style phoenix.evals over OTel traces with embedding-drift extensions. Evaluation model LangSmith Dev loop: Prompt Hub + offline runs over curated datasets; online evals sample prod traces Braintrust Eval() as code: runs in CI per PR, diffs vs baseline, blocks merge on regression. Scorers are TS / Py fns. Helicone Replay + experiments over captured production requests. Evals live alongside cost / latency dashboards, not in CI. Arize Phoenix phoenix.evals runs as a Python lib over OTel traces (notebook-first); embedding + retrieval drift as ML-obs add-ons
Where in the lifecycle "did this get worse?" is supposed to be answered.

All four can run an LLM-judge against a dataset. They diverge on which moment of the lifecycle the eval is meant to live in. Braintrust is the strongest opinion: an eval is code, it runs on every PR, and a regression blocks merge — the answer to "did this get worse" arrives before the change ships. LangSmith straddles dev and prod through the Prompt Hub plus online evaluators that sample production traces and surface regressions after the fact. Helicone anchors its evals to captured production requests — its replay/experiments flow re-runs a real prod prompt with a new template, which is closer to a post-hoc what-if than a gating test. Phoenix runs evaluators as a Python library over OTel traces (comfortable in a notebook, less so as a CI gate) and adds the embedding-drift view as a continuous monitor rather than a discrete test. None of these is wrong; they answer the question at different points, and the right pick depends on whether your bug is "we regressed on a known dataset" (Braintrust), "production looks weird and I want to know why" (Helicone, Phoenix), or "the prompt change I'm about to ship is risky" (LangSmith).

Open-source vs SaaS — and who holds the data

Open-source vs SaaS deployment Four-column comparison of deployment posture: LangSmith is SaaS-first with a paid self-hosted tier, Braintrust is SaaS with an enterprise self-host, Helicone is Apache-2.0 open source with a hosted offering, Arize Phoenix is Elastic-2.0 open source designed to run anywhere with optional Phoenix Cloud / Arize AX upgrade. Open-source vs SaaS LangSmith SaaS-first. Self-hosted tier behind an enterprise contract. Closed source; SDK is OSS. Braintrust SaaS, with an enterprise self-host (BYO-cloud). Open-source SDK + autoevals lib; backend is closed. Helicone Apache-2.0 OSS end-to-end, including the gateway and UI. Hosted cloud is the easy path; docker self-host works. Arize Phoenix Elastic-2.0 OSS, designed to run anywhere — laptop, k8s, notebook. Phoenix Cloud (hosted) + Arize AX (paid prod).
OSS posture also decides who holds your prompts, your outputs, and your eval datasets.

This axis splits cleanly into two pairs. LangSmith and Braintrust are SaaS-first products with paid self-host tiers for enterprises that cannot send prompts to a vendor — the OSS components are SDKs, not the backend that stores your data. Helicone and Phoenix are genuinely open source: Helicone's gateway is Apache-2.0 and you can run the full stack on your own boxes; Phoenix is Elastic-2.0 and is designed to run anywhere, from a notebook cell to a Kubernetes deployment. For a regulated workload — health, finance, government — that pair is the natural starting point, with Phoenix's OTel-native design and Helicone's gateway shape covering different halves of the problem. For everyone else the trade is convenience: SaaS-first products move faster on UI polish and online-eval features; OSS products move faster on portability and "we own the data plane."

When to pick which

Use case Pick LangSmith if… Pick Braintrust if… Pick Helicone if… Pick Arize Phoenix if…
Tightening a prompt-iteration loop You live in LangChain/LangGraph and want a Prompt Hub + Playground + online evals as one product. You want every prompt change to ship through CI with a regression diff before it merges. Not the natural fit — Helicone watches what production does, not what dev is about to ship. You will assemble the loop yourself in a notebook over OTel traces.
Making evals fail PRs Possible via SDK + CI, but the workflow is not as opinionated. This is the entire pitch — Eval() in TS/Py, regression diff, blocking PR check. Replay/experiments are post-hoc, not a CI gate. You can wire phoenix.evals into CI, but it is a library, not a workflow.
Getting cost + latency telemetry on prod today Possible, but instrumentation requires SDK changes. Possible via wrappers, but the centre of gravity is offline runs. Yes — point base_url at the gateway and the dashboards light up in minutes. Yes if you already run OTel; the OpenInference instrumentors give you cost + latency out of the box.
Monitoring embedding/RAG drift Limited — not the niche. Limited — not the niche. Limited — gateway-level metrics, not embedding-space metrics. This is the inherited Arize muscle — UMAP, embedding drift, retrieval drift.
Self-host + own the data plane Available only on the paid enterprise tier. Available only on the paid enterprise tier. Yes — Apache-2.0, run docker-compose anywhere. Yes — Elastic-2.0, runs in a notebook through to Kubernetes.
Vendor-neutral instrumentation OTel supported, but the value is LangChain-shaped. SDK-coupled. Proxy is portable, but data lives in Helicone-shaped tables. OpenInference + OTel by construction — re-point the exporter, keep the spans.

FAQ

Do I have to use LangChain to use LangSmith?

No, but you give up most of the LangChain-shaped UI affordances. The platform ingests traces from arbitrary Python or JavaScript code via the @traceable decorator and also accepts OpenTelemetry data over OTLP, so a non-LangChain stack can use LangSmith for traces, datasets, and evaluators. The reason to pick LangSmith over the alternatives, though, is the framework-aware view — agent-step grouping, prompt-version linkage in the Prompt Hub, graph trace visualisations — which are most valuable when your app is already a LangChain or LangGraph graph.

Can I run Braintrust evals locally without paying for the SaaS?

Yes — the SDK and the autoevals scorer library are open source, and braintrust eval ./eval.ts will run your eval suite and print scores from a laptop or CI runner. What you lose without the SaaS backend is the persisted experiment store, the side-by-side regression diff UI, the playground for prompt iteration, and the online logging path that feeds production traces back into your dataset. The pitch is that those persisted artifacts are what make the CI loop close — local-only is fine for the first eval, less fine for the tenth pull request.

Is Helicone really just a proxy? What if I cannot route traffic through a vendor?

The canonical install is the proxy, but it is not the only one. Helicone also offers an async logging SDK that captures requests in your process and streams them to the platform without rerouting traffic, which suits cases where the vendor URL is fixed or compliance forbids the redirect. The full Helicone stack including the gateway is Apache-2.0, so the strongest version of "I cannot send this to a vendor" is to self-host the gateway entirely — same wire shape, your own cluster.

What is the difference between Arize Phoenix and Arize AX?

Phoenix is the open-source LLM-observability project — Elastic-2.0, free, self-hostable, with Phoenix Cloud as a managed hosted version. Arize AX is the paid production tier from the same company: it inherits the OpenInference data model and the embedding-drift heritage, and adds the scale, alerting, role-based access control, and the broader ML-observability surface for non-LLM models that production teams typically buy. The graduation path is intentional — start on OSS Phoenix in dev, migrate to AX when production scale or governance demands it, without re-instrumenting because both speak OpenTelemetry.

Which of these supports OpenTelemetry?

All four claim some support; the depth differs sharply. Arize Phoenix is OTel-native by construction — OpenInference is the data model and the instrumentors are OTel auto-instrumentors, so the same spans can ship to any OTel backend. LangSmith and Braintrust both accept OTLP data, which is enough to use them as the trace sink in an OTel pipeline, but neither was designed OTel-first and their richest features assume their own SDK shape. Helicone is gateway-first; OTel is a secondary path. If "instrumentation must survive a future vendor switch" is on your checklist, Phoenix is the only answer that does not require an asterisk.

Do I need an evals platform at all if I have a few unit tests?

Unit tests cover deterministic code; LLM behaviour is not deterministic. Even a tiny LLM-judge over a hand-curated golden set will catch prompt regressions that unit tests cannot see, and the cost of running one is hours, not weeks — see Evals 101 for the minimum-viable setup. The case for picking a platform over rolling your own grows with the number of evaluators you maintain, the number of people changing prompts, and the need to compare runs side-by-side rather than as scrollback. Below that bar, a CSV of test cases and a Python script is honest work.

Further reading

On this wiki:

  • Evals 101 — what an eval set actually contains, why LLM-judges are not free, and the minimum-viable harness before any platform.
  • Agent Frameworks — what a framework adds over raw model calls, since LangSmith's value is tightest when your app is already framework-shaped.
  • Cost, Quality, Latency — the three-way trade these platforms instrument, and why gateway-level numbers (Helicone's home turf) and quality scores (Braintrust's) are not interchangeable.
  • Evaluating RAG — the retrieval/grounding/answer-quality split, useful before you pick a platform to score it on.
  • Evaluating Memory Quality — memory-specific metrics (recall@k, staleness, drift) that are easier to monitor on a platform that already speaks embeddings.

Project sources: