AI Blog

Agentic AI for Trading Research: When the LLM Sits in the Loop

The hype says AI agents run the fund; the reality in 2026 is that LLM agents run the research desk — fundamentals, sentiment, bull-bear debate, risk sign-off — while rule-based code still pulls the trigger. Knowing where the line sits is the difference between deploying the pattern and over-trusting it.

By Agentic AI Wiki 16 min read

LiveTradeBench ran twenty-one frontier LLMs through fifty days of live U.S. equities and Polymarket trading in the autumn of 2025 and reached a verdict that should reshape how anyone evaluates "AI fund" claims: a model's rank on LMArena predicts almost nothing about its rank on P&L. The agentic layer of a trading shop is real and getting deployed — but it lives on the research desk, not the execution venue, and the patterns that work look more like a junior-analyst team than a robo-trader. Read the agent firm pattern, and the "autonomous AI trader" narrative resolves into a much narrower (and more defensible) claim.

At a glance

Four projects bracket the 2026 agentic-trading landscape — TradingAgents as the multi-agent research framework, BloombergGPT as the closed domain LLM, FinGPT as the open one, and LiveTradeBench as the live evaluation harness. Together they describe the design space of "an LLM in the trading loop."

ProjectReleasedPrimary roleDeployment shape
TradingAgents2024-12 → v7 2025-06Multi-agent research frameworkOpen-source Python; analyst → researcher debate → trader → risk supervisor.
BloombergGPT2023-03Closed domain LLM (50B params, finance-pretrained)Internal-only at Bloomberg; specialised for sentiment, QA, NER on finance text.
FinGPT2023 → ongoingOpen-source finance LLM familyFine-tunes on top of open base models; outperforms BloombergGPT on several public benchmarks.
LiveTradeBench2025-11Live trading benchmark50-day live eval (Aug–Oct 2025) across U.S. equities + Polymarket, 21 LLMs.

The LiveTradeBench finding worth highlighting: static-benchmark rank (LMArena, MMLU) does not predict live trading P&L. Some models that placed mid-pack on chat benchmarks topped the trading evaluation; some chat-benchmark leaders placed last. The implication for anyone building an agentic trading system: do not pick the model by leaderboard, evaluate on the actual task.

The agent-firm architecture

Trading agent firm architecture Three specialist analyst agents (fundamentals, sentiment, technicals) feed two researcher agents arguing the bull and bear case. A trader agent synthesises; a risk-supervisor agent gates the proposal before it is handed to a deterministic broker outside the LLM region. LLM region · research desk Deterministic execution Fundamentals analyst Filings · ratios · guidance → memo Sentiment analyst News · StockTwits · Reddit → memo Technicals analyst Chart · LOB imbalance → memo Bull researcher Defend the long case from memos Bear researcher Attack the long case from memos Trader agent Synthesises memos + debate → proposal Risk supervisor Approve / modify / reject vs portfolio Broker SOR / TWAP / RL + guards The line between LLM region and deterministic execution is the load-bearing boundary in 2026 production stacks.
The agent firm. Specialists feed researchers, researchers debate, the trader synthesises, the risk supervisor gates. Order placement leaves the LLM region.

The reference design — clearest in the TradingAgents paper, but mirrored in a half-dozen 2025 follow-ups — is a small specialised firm rather than a single super-agent. Three specialist analyst agents read different evidence: a fundamentals analyst reads financials and filings, a sentiment analyst aggregates news, StockTwits, and Reddit into a mood signal, a technicals analyst reads chart patterns and order-book imbalance. None of them is asked to make the call alone — each writes a short analytical memo and hands it up the chain.

The next layer is a bull researcher and a bear researcher who argue the trade. The two researchers are instantiations of the same model with different prompts pushing them toward opposite conclusions; their debate produces a transcript that surfaces the weakest part of each thesis. Debate, voting and ensembles covers the general result: the gain from debate depends entirely on engineered diversity. Without forcing the bull and bear to defend opposite positions, they collapse to the initial majority and the debate is theatre.

A trader agent reads the analyst memos and the debate transcript and proposes an action. A risk supervisor reads the proposal against current portfolio state and either approves, modifies, or rejects. This is exactly the supervisor–worker pattern the deep-dive describes: workers do the local reasoning, the supervisor enforces global constraints. The risk supervisor is also where the human-in-the-loop hook lives in production — at retail platforms it is automated; at institutional shops it is gated behind a human review.

The execution path leaves the LLM region. The risk-approved order goes to a deterministic execution system: a SOR (smart order router), a TWAP or VWAP scheduler, or an RL-derived execution policy under hard rule-based guards. Press accounts of agent-firm deployments in 2026 all describe the same line — the LLM is in research, the rule-based code is in execution. Crossing the line is the failure mode the production teams design hardest against.

Memory and tools the trading agent needs

Trading agent: tools and memory surface A single LLM agent at the centre with four tool connections (market-data read, news/transcript RAG, portfolio-state read, broker write — guarded) and two memory blocks (short-term scratchpad for the current decision; long-term store of prior theses and outcomes). The broker tool is marked guarded. LLM agent read → reason → act tool calls + memory I/O Tools Market data (read) price · LOB · greeks · vol News / transcripts (RAG) filings · analyst notes Portfolio state (read) positions · cash · exposures Broker (write) · guarded human approval at institutional GUARD Memory Short-term · this decision's scratchpad analyst memos · debate transcript supervisor reply · current plan Long-term · prior theses + outcomes what was called right + what was missed queried for the next similar decision
A trading agent's tool surface and memory. Read tools are commodity; the broker write tool is where the guard goes.

A trading agent's tool surface is narrower than you might expect: a market-data read tool, a news/transcript retrieval tool (essentially agentic retrieval over a private corpus of filings and analyst notes), a portfolio-state read tool, and a broker-write tool. The first three are commodity reads; the fourth is where the production stack puts its guard. At retail platforms, the broker tool is auto-confirmed under tight per-user limits. At institutional shops, the broker tool either does not exist for the LLM agent at all — the LLM writes a structured proposal that a separate deterministic system executes — or it is gated behind a human approval. Tool-design principles covers why the boundaries on the broker tool are load-bearing, and structured tool I/O covers the schema discipline that keeps the proposal machine-readable.

Memory splits into two distinct stores. Short-term memory is the scratchpad for the current decision — the analyst memos, the debate transcript, the supervisor's response. Long-term memory holds prior theses and how they played out — a thesis that called a beat correctly is high-value context for the next earnings cycle; a thesis that called a beat and was wrong is even more valuable, because it teaches what the agent missed. See memory stores for the backend choice and short vs long-term memory for the boundary discipline. A common failure in early agent-firm deployments is no long-term store at all — every decision is made from a fresh context, which means the agent learns nothing across decisions.

Domain LLMs vs prompted general LLMs

Domain LLM vs open finance LLM vs prompted general LLM Three columns — BloombergGPT (closed domain LLM, finance pre-trained), FinGPT (open finance LLM family fine-tuned on open base), and a prompted frontier general model (Claude or GPT-class). Compared across four axes: finance recall, generality, openness, cost per call. Each cell is shaded high/medium/low. BloombergGPT FinGPT Prompted general Finance recall sentiment / NER / QA High High Medium Generality non-finance reasoning Low Medium High Openness deploy / fine-tune Closed Open API-only Cost per call at production scale Medium (internal) Low (self-host) High Low / weak Medium High / strong
The trade space is finance-recall vs generality vs openness vs cost. None of the three dominates the others; the right pick depends on the use case.

BloombergGPT is the canonical closed domain LLM — 50B parameters pre-trained on Bloomberg's proprietary finance corpus. It excels at sentiment classification, financial NER, and finance-specific QA, and it lives inside Bloomberg's products. The trade-off is that it is closed: outside Bloomberg, you cannot deploy or fine-tune it, and you cannot integrate it into your own agent firm. The 2026 reality is that several open finance LLMs — most prominently FinGPT — have caught up or pulled ahead on public finance benchmarks at a fraction of the training cost, because the data moat narrowed faster than the compute moat.

FinGPT fine-tunes on top of open base models (LLaMA, Mistral, and successors), trains cheaply, and outperforms BloombergGPT on several published finance NLP tasks. The trade-off is the inverse: FinGPT specialises hard, and its generality outside finance is worse than the base model it was trained from. If your agent firm only needs to read financial text and pick stocks, FinGPT is a strong fit; if any agent in the firm needs to reason about something other than finance — a geopolitical event, a regulatory change, a supply-chain failure — the specialisation hurts.

Prompted frontier general models — Claude, GPT-class, Gemini — sit at the third corner. They are the default for the analyst and researcher roles in TradingAgents-style frameworks because they bring broad reasoning and tool-use capability; they pay for it in cost per call and in finance recall that is good but not best-in-class. The 2025–2026 trend is hybrid: use a finance-pretrained model as a sentiment-classifier tool inside an agent firm whose reasoning loop runs on a general frontier model. Prompt, fine-tune, or RL covers the general decision rule for when to escalate from prompting to specialisation.

What live benchmarks reveal

LiveTradeBench: static benchmark rank vs live P&L rank Two ranked columns of LLM family labels (left: static LMArena rank top to bottom; right: live trading P and L rank top to bottom over the 50-day August to October 2025 LiveTradeBench window). Connection lines between the same family on the two columns cross substantially, illustrating that the two rankings do not match. Rank stability check · LMArena ↔ LiveTradeBench P&L Stylised — 5 frontier families · 50-day window Aug 18 – Oct 24, 2025 LMArena (static) rank LiveTradeBench P&L rank #1 Family A #2 Family B #3 Family C #4 Family D #5 Family E #1 Family D #2 Family A #3 Family E #4 Family B #5 Family C Crossing connectors illustrate the headline finding: high LMArena rank does not imply high P&L rank.
LiveTradeBench's headline finding, schematically. Static-benchmark rank (LMArena) and live trading-P&L rank do not line up; the connection lines cross.

LiveTradeBench ran 21 LLMs across U.S. equities and Polymarket prediction markets in a fifty-day live window from August 18 to October 24, 2025. The paper's three headline findings cut against several intuitions practitioners walk in with. First, high LMArena scores do not imply superior trading outcomes — some chat-benchmark leaders ranked at the bottom of the P&L table, and some mid-pack chat models topped it. Second, distinct portfolio styles emerged that reflect each model's risk appetite and reasoning dynamics — some models concentrate in conviction trades, some diversify aggressively, and the same prompt elicits structurally different styles from different families. Third, only some LLMs effectively leverage live signals to adapt their decisions — others appear to anchor on their training-era priors and miss live regime changes entirely.

The methodological detail to take seriously is that LiveTradeBench runs against actual market data streams with portfolio-level control — it is not a backtest. Backtest-only evaluations are exactly the regime in which overfitting hides. Evals 101 covers the general failure: a benchmark that does not vary the input distribution between train and eval tells you nothing about deployment behaviour. The reason the trading-research community is reaching for live benchmarks now is that paper-based backtest results have lost credibility in 2026 — and the live benchmark numbers are humbler.

An adjacent benchmark, TraderBench, evaluates AI agents under adversarial market conditions and finds that agents that ranked well on cooperative benchmarks degrade sharply under adversarial pressure. Multi-agent failure modes covers the general result; the trading-specific implication is that an agent firm tuned in a cooperative simulator will not generalise to a market where other participants are trying to detect and trade against it.

When to pick which

Use casePick TradingAgents-style firm if…Pick BloombergGPT (or domain LLM) if…Pick prompted general LLM if…
Equity research desk You want bull/bear debate and a synth memo, run nightly across coverage list. You only need sentiment + NER on filings, and Bloomberg is already your stack. You need the agent to also reason about non-finance context (geo, supply chain).
Sentiment signal You need diverse opinions cross-checked before the signal fires. You need best-in-class accuracy on financial sentiment text only. You need quick prototyping and can tune cost later.
Trade idea generation You want the auditable thesis chain a regulator can read. You are inside a closed corporate stack and need integration over flexibility. You want to spin up multiple specialist agents quickly without training.
Execution Do not — execution stays rule-based. Do not — execution stays rule-based. Do not — execution stays rule-based.
Feature matrix: agentic-trading systems Heatmap matrix. Rows are four systems — TradingAgents, BloombergGPT, FinGPT, Prompted general LLM. Columns are five capabilities — finance recall, multi-agent orchestration, openness, production-ready boundaries, cost efficiency. Each cell is shaded strong, medium, or weak with a short label. Where each system leans hardest Finance recall Multi-agent orchestration Openness / deployability Production boundaries Cost efficiency TradingAgents Medium Strong (firm pattern) Strong (OSS) Medium Medium BloombergGPT Strong Weak (single model) Weak (closed) Strong (internal) Medium FinGPT Strong Medium Strong (OSS) Medium Strong (self-host) Prompted general LLM Medium Strong (broad) API-only Strong (mature) Weak Weak Medium Strong
Where each system leans hardest. No row dominates; the right pick is per use case.

FAQ

Can an LLM agent actually trade autonomously?

At retail scale, yes — several 2026 retail platforms place small orders without human review, under tight per-user dollar and concentration limits. At institutional scale, almost never — the LLM writes a structured proposal and a deterministic execution system places the order. The institutional reluctance is not technical conservatism, it is auditability: a non-deterministic agent's order is harder to reconstruct after the fact than a rule-based router's.

Why do production stacks still use rule-based execution?

Latency, auditability, and failure-mode containment. An LLM call costs hundreds of milliseconds; an execution decision costs microseconds. An LLM call is non-deterministic and hard to explain after a loss; a rule-based router is deterministic and auditable. An LLM hallucination at the broker tool is a market-impact event; a rule-based router cannot hallucinate an order. The line between research and execution is where these three concerns are containable.

How do you handle hallucination in agent research output?

Three mechanisms in combination. First, structured tool I/O — the analyst memos and trader proposal are typed objects, not free text, so a hallucinated ticker symbol fails schema validation before it reaches the trader (see structured tool I/O). Second, the risk supervisor reads the proposal against portfolio state and rejects anything inconsistent with cash on hand, exposure limits, or position direction. Third, the broker tool itself validates that the ticker, side, and size are well-formed before placing — a hallucination that survives the first two gates dies at the third.

Does fine-tuning on finance data beat a prompted frontier model?

On narrow finance tasks — sentiment, NER, financial QA — yes, often by a meaningful margin. On agent-firm tasks that require general reasoning, planning, and tool use, the prompted frontier model usually wins. The 2026 hybrid pattern is to use a fine-tuned model as a tool inside an agent firm whose reasoning loop runs on a frontier model, getting both the specialisation and the generality.

Further reading

On this wiki

External sources