AI Blog

FinRL vs TensorTrade vs ABIDES-Gym vs ElegantRL: Who Controls the Simulation Contract

Four RL-for-trading projects, four near-identical feature lists — Gymnasium env, OHLCV ingest, PPO/SAC/A2C/DQN, backtest evaluation. The thing that actually decides which survives a serious research-or-prod loop is invisible there: who controls the simulation contract.

By Agentic AI Wiki 27 min read

Read the README of any of these four projects and the feature lists look almost identical: a Gymnasium env, OHLCV ingest, PPO/SAC/A2C/DQN out of the box, a backtest curve. The thing that decides which one survives a serious research-or-prod loop is invisible there: who controls the simulation contract — the action shape, the cost model, the slippage assumption, the reward, where an episode ends. FinRL bakes opinionated finance assumptions into the env so you trade off control for convenience. TensorTrade keeps the env data-source-agnostic and asks you to assemble the contract from plug-ins. ABIDES-Gym derives the contract from a discrete-event LOB simulator that other agents trade inside. ElegantRL is an RL-engineering library that treats trading as one application — the env is your problem, the trainer's job is to be fast. Pick on that axis first; everything else follows.

At a glance

Four RL-for-trading projects, four answers to the same question — what does the env decide for you, and what do you have to bring yourself. The table lists the basics; the matrix below it shows where each one leans hardest across the axes that actually differ.

Project Released / maintainer Primary niche Where it runs
FinRL 2020, AI4Finance Foundation End-to-end financial RL framework with baked-in finance assumptions Local Python / notebook; demos for cloud GPUs
TensorTrade 2019, tensortrade-org (community) Composable Gymnasium env assembled from action / reward / exchange plug-ins Local Python; Ray Tune for scale-out
ABIDES-Gym 2021, J.P. Morgan AI Research Gym wrapper around the ABIDES multi-agent LOB simulator Local Python; single-threaded simulator process
ElegantRL 2021, AI4Finance Foundation Cloud-native, massively parallel deep-RL library (finance is one app) Single GPU; multi-pod cluster via Podracer

Snapshot: 2026-06-02. The "MarketGym" name circulating in survey papers is not a single canonical project — we substitute ABIDES-Gym as the LOB-microstructure peer because it is the actively maintained reference implementation. Frameworks move quickly; verify against current docs.

RL-for-trading feature matrix Heatmap comparing FinRL, TensorTrade, ABIDES-Gym, and ElegantRL across six axes: out-of-the-box trading env, modular composability, LOB / microstructure fidelity, bundled RL algorithms, parallel-env throughput, and broker / live-trade hooks. Strength indicated by fill color from light (weak) to dark orange-red (strong). RL-for-trading feature matrix Ready-made trading env Modular composability LOB / microstructure Bundled RL algos Parallel-env throughput Broker / live hooks FinRL StockTrading Env (default) Inheritance, not plug-in OHLCV only, no LOB SB3 / RLlib / ElegantRL vec-env via SB3 Alpaca paper trade demo TensorTrade Generic env, you assemble Action/Reward/ Exchange plug-ins Plug-in dependent None bundled, use any trainer SubprocVecEnv (CPU only) CCXT exchange plug-in ABIDES-Gym Execution + daily-investor envs Custom agent classes in sim Full LOB + latency model BYO trainer Single-thread discrete-event Sim-only ElegantRL FinRL-Meta stock env demos Env plug-in contract OHLCV only Own PPO/SAC/ TD3/A2C/DQN 4 k – 16 k envs on one GPU Sim-only Weak Medium Strong
Where each project leans hardest. The axes converge on the surface and pull apart where the simulation contract is decided.

FinRL — deep dive

FinRL architecture FinRL is a vertically integrated stack: market-data layer (Yahoo, Alpaca, CCXT), preprocessed environment layer with finance-specific defaults (transaction cost, turbulence, reward shape), agent layer wrapping Stable-Baselines3/RLlib/ElegantRL, and a backtest/paper-trade layer with built-in baselines. Data layer · DataProcessor Yahoo · Alpaca · CCXT · Tushare · CSV — OHLCV pull, indicator features, train/test split Environment layer — the finance-baked simulation contract StockTradingEnv · PortfolioOptimizationEnv · CryptoEnv — defaults are opinionated finance assumptions Action space Box([-1,1], n_stocks) = fraction of position scaled by hmax Reward Δ portfolio value incl. transaction cost optional turbulence cap Cost & slippage flat bps per trade market impact: none fills at close price Episode one day = one step terminates on cash-out or end of price series Agent layer · DRLAgent wrapper forwards calls to SB3 (PPO/A2C/SAC/TD3/DDPG) or RLlib · or ElegantRL backend (chosen by import path) hyper-params surface through one shared train() / predict() API Backtest / paper trade built-in baselines: DJIA / SP500 buy-and-hold pyfolio tear-sheet · Sharpe · max DD Alpaca paper-trade hook (optional) Trade-off: fewer choices to make, fewer choices you control
FinRL is a vertically integrated stack: data → finance-opinionated env → DRLAgent facade → backtest, with the simulation contract baked into the env layer.

The simulation contract

FinRL ships StockTradingEnv, PortfolioOptimizationEnv, and a small zoo of crypto / paper-trading envs whose defaults are not neutral — they are finance opinions written into code. Action space is Box([-1, 1], n_stocks) interpreted as a fraction of position scaled by hmax; reward is the change in portfolio value with a bps transaction cost subtracted; fills happen at the close price of the bar; an episode ends when cash runs out or the price series does; and there is an optional turbulence index that flattens positions during volatile regimes. None of these are negotiable through configuration knobs alone — to change them you subclass the env and override the relevant methods. That is the FinRL deal: the contract is the framework's answer to "how a portfolio agent trades a basket of US stocks at daily frequency," and most of the work is already done.

Agent and trainer

The agent layer is a thin DRLAgent facade that delegates to Stable-Baselines3 (PPO, A2C, SAC, TD3, DDPG), RLlib, or ElegantRL by which import path you load. A single agent.train(...) / agent.predict(...) pair hides the trainer API, which is the whole reason FinRL is a popular on-ramp: copy the notebook, plug your tickers in, and a working PPO portfolio agent runs end-to-end in under an hour. The cost of that on-ramp is that the trainer is somebody else's — when SB3 deprecates an arg or RLlib changes its config shape, you inherit the churn. FinRL-Meta and the newer FinRL-X repo modernize the data layer and the deployment story, but the core "agent over baked env" pattern is the same.

What the runtime makes hard

Two things. First, escaping the contract once you outgrow it. If you want intra-day fills with realistic slippage, market impact, or a non-Box action space (place a limit order at a price level, cancel a working order), you are either subclassing the env or rewriting it — the framework helps you less the further you stray. Second, validating that the trained policy survives a real exchange's microstructure. The default close-price fill is a generous assumption; PnL curves in the FinRL backtest can be optimistic relative to anything with an order book, and FinRL does not warn you about that — the burden is on you to know it. The pragmatic move when these bite is to use FinRL for data prep and baselines and run final evaluation on an LOB simulator like ABIDES-Gym.

TensorTrade — deep dive

TensorTrade architecture TensorTrade is composable: data feeds, an exchange abstraction, action schemes, reward schemes, and observers all plug into a TradingEnv at runtime. The environment is data-source agnostic — you assemble the simulation contract rather than inherit one. PLUGGABLE COMPONENTS — you choose DataFeed CSV · CCXT · Stream — your call Exchange & Instruments slippage model · commission model ActionScheme BSH · ManagedRisk · custom RewardScheme RiskAdjusted · SimpleProfit · custom Observer / Renderer tensor obs window · matplotlib · screen TradingEnv (Gymnasium) composes the pluggables you pass in Wallet · Portfolio · Orders multi-asset, multi-currency order book is your slippage model step() loop obs ← Observer reward ← RewardScheme(portfolio) action_space defined by ActionScheme you passed discrete or continuous No baked-in finance assumptions — you assemble the contract YOUR AGENT / TRAINER Ray Tune (recommended) project's quickstart path DQN / PPO / SAC via RLlib SB3 (community) env is a Gymnasium env so SB3 plugs in directly Bring your own CleanRL · custom PyTorch env contract is the only API Backtest replay env on test split Observer log → notebook
TensorTrade is composable: the env is a thin shell that wires together the plug-ins you pass in, so the simulation contract is whatever you assembled.

The simulation contract

TensorTrade's TradingEnv is intentionally hollow. The action contract comes from an ActionScheme object you supply (the bundled ones are BSH, a binary buy-sell-hold, and ManagedRiskOrders for stop-loss / take-profit; everything else is custom). The reward contract comes from a RewardScheme (SimpleProfit, RiskAdjustedReturns for Sharpe-ish, or your own). The exchange contract comes from an Exchange object that defines the slippage and commission models. Asset universe and currency are typed objects you instantiate before passing them in. This means there is no "default trading env" in the FinRL sense — the env is the composition of choices you made at construction time, and two TensorTrade users with the same library are usually running two materially different simulations.

Agent and trainer

TensorTrade does not bundle an RL implementation. The env is a Gymnasium-compatible object, so any external trainer plugs in directly: the project's tutorials lean on Ray RLlib through Ray Tune for hyper-parameter search, but Stable-Baselines3 and CleanRL work without modification because the contract is just step / reset / observation_space / action_space. This is a deliberate division of labor — TensorTrade is the env library, not the agent library, which means when SB3 ships a new PPO variant or CleanRL adds a recurrent-policy fix you inherit it for free, and when you want to swap PPO for SAC you change one import line, not the framework.

What the runtime makes hard

Composability is a tax. Spinning up a first TensorTrade env is meaningfully more work than spinning up a first FinRL env because there is no opinionated default — you choose the action scheme, the reward shape, the slippage model, and the data feed before anything runs. For teams that know what shape they want this is a feature; for newcomers without a strong opinion it is a blank canvas that delays the first learning curve. The other sharp edge is that fidelity is bounded by the plug-in. The default Exchange uses commission percentages and an OHLCV feed; if you want queue position and realistic latency you have to bring your own simulator behind the Exchange interface, which is doable but not what the library hands you. The honest framing: TensorTrade gives you the contract you asked for, not the contract the market gives you.

ABIDES-Gym — deep dive

ABIDES-Gym architecture ABIDES-Gym wraps the ABIDES discrete-event multi-agent limit-order-book simulator: a synthetic exchange runs against background trader agents while a Gym wrapper exposes one focal RL agent's observation and action. Episodes are measured in microseconds, observations are LOB snapshots, and actions are real order types. ABIDES kernel — discrete-event multi-agent simulator microsecond time grid · message bus · NASDAQ-style price/time priority Exchange agent Limit Order Book (LOB) Background trader agents Noise traders Value traders (with α signal) Momentum / mean-reversion bots Reference Market Makers (POVMM) — produce realistic order flow — each agent acts on its own arrival schedule Message bus · Kernel delivers LIMIT_ORDER, MARKET_ORDER, CANCEL, TRADE messages computational latency model on every hop queue priority preserved by arrival time RL Gym wrapper drives ONE focal "experimental" agent step() resumes kernel until next decision Obs space LOB snapshot Action space place/cancel Reward = realized PnL slice (or Almgren-Chriss benchmark) RL agent (yours) SB3 / RLlib / custom PyTorch policy returns: (order_type, price, size) — execution agent or market maker throughput limited by simulator (single-threaded discrete event loop) What you get - impact: endogenous from book - slippage: from queue position - realistic latency & partial fills - one focal agent per episode (or multi via ABIDES-MARL) - slow vs OHLCV sims
ABIDES-Gym wraps a discrete-event multi-agent LOB simulator: the agent trades inside a synthetic exchange populated by other agents that generate realistic order flow.

The simulation contract

ABIDES-Gym is not an RL framework in the FinRL or TensorTrade sense — it is a Gym wrapper bolted onto the ABIDES discrete-event simulator. The simulation contract is whatever the simulator says happens: the exchange agent maintains a real limit order book with price-time priority and discrete tick sizes; background trader agents (noise, value, momentum, reference market makers) act on their own arrival schedules and place real LIMIT / MARKET / CANCEL messages that move the book; latency is modeled at every hop; partial fills and queue position fall out of the kernel rather than being approximated. Your RL agent is one more agent inside the kernel — the "experimental" agent — driven by a Gymnasium step() that resumes the simulator until the agent's next decision moment. Actions are real order types (place, cancel, modify); observations are LOB snapshots; rewards are realized PnL slices or an execution-quality benchmark such as Almgren-Chriss.

Agent and trainer

The repo ships two benchmark envs — a daily-investor env and an execution / liquidation env — plus the simulator and example training scripts. The agent and trainer themselves are your problem: ABIDES-Gym exposes a Gym interface, so SB3 PPO, RLlib, or custom PyTorch all plug in directly. The reference ABIDES-Gym paper uses Deep Dueling Double Q-learning with the APEX architecture; recent extensions like ABIDES-MARL adapt the kernel for multi-agent RL where several adaptive agents learn simultaneously inside the same book. The trainer side is intentionally thin: the value here is the simulator, not the policy code.

What the runtime makes hard

Throughput. ABIDES is single-threaded discrete-event Python — the message bus processes one event at a time, with priority-queue ordering by arrival timestamp. A day of simulated NASDAQ trading takes meaningful wall-clock time, and you cannot trivially vectorize the simulator the way you would a stateless OHLCV env. The standard workaround is process-level parallelism (multiple simulator processes through SubprocVecEnv), which scales linearly with CPUs but not with GPUs. The second sharp edge is calibration: the background-trader populations have parameters (noise-trader arrival rate, value-trader signal noise, market-maker spread) that you have to tune to get order flow that resembles a real venue, and "resembles" is a judgement call. The honest position: ABIDES-Gym gives you the most realistic simulation contract in this set, and the price is single-process throughput plus a calibration project.

ElegantRL — deep dive

ElegantRL architecture ElegantRL is an RL engineering library first: thousands of vectorized environments on a single GPU through Isaac-Gym-style parallel sim, an Actor/Critic agent in PyTorch, and an Evaluator-driven training loop. Finance is one of many application domains; the trading env is a plug-in like any other. Single GPU — massively parallel rollouts VecEnv on device tensors · 4 k – 16 k envs at once env env env Replay / rollout buffer (on GPU) (states, actions, rewards) — one batched tensor, no CPU copy Actor network PPO · SAC · TD3 · DDPG · A2C · DQN DoubleDQN · D3QN · REDQ all share an Agent.update_net contract Critic network (twin Q) PyTorch nn.Module target nets, KL leash, GAE — RL engineering primitives ENV PLUG-IN — finance is one of many Isaac Gym tasks robotics / control — designed for this vectorized on the GPU FinRL-Meta StockTradingEnv vectorized stock env on the GPU share-price tensor instead of robot state Stock_NeurIPS2018 demos shipped Atari / MuJoCo / custom same Agent.train works unchanged env contract = (obs, reward, done) Evaluator + Worker (Podracer) cloud-native scale-out for cluster training tournament ensemble of pods on K8s checkpoint elasticity + leaderboard — hundreds of GPUs if you have them
ElegantRL is RL-engineering-first: massively parallel env rollouts on one GPU, clean Actor/Critic PyTorch primitives, and an env plug-in slot where finance is one task among many.

The simulation contract

ElegantRL is the odd one out: it is an RL library, not a trading framework. The simulation contract is whatever the env you bolt on says it is. The shipped finance demos use a FinRL-Meta StockTradingEnv running as a vectorized tensor on the GPU — share-prices indexed as one dimension of a batch, instead of robot state — so the contract there is an OHLCV-shaped one inherited from the FinRL-Meta repo. The library's center of gravity is somewhere else: an Isaac-Gym-style vectorized env loop that runs 4 k–16 k parallel envs on a single GPU with no CPU-side copy, a clean Actor/Critic separation, and a Podracer cloud-native layer that scales the same code to hundreds of GPUs via a tournament ensemble of pods.

Agent and trainer

This is the half ElegantRL takes most seriously. The repo ships its own clean-room PyTorch implementations of DQN, Double DQN, D3QN, REDQ, A2C, PPO, DDPG, TD3, SAC — all conforming to a shared Agent.update_net(buffer) contract, with twin-Q targets, GAE, KL leashes, and the rest of the engineering primitives. The point is not "we re-implement PPO" — it is "we re-implemented PPO so the rollouts can stay on-device and the buffer can be a GPU tensor, not a CPU queue." Compared with SB3 (which prioritizes algorithmic clarity over raw throughput) and RLlib (which prioritizes cluster scale-out over per-GPU efficiency), ElegantRL chose throughput per GPU as its design point.

What the runtime makes hard

Two things. First, the finance env you get is not the library's center of gravity — it is a port of FinRL-Meta running as a vec-env, which means the simulation contract is still OHLCV with the same close-price-fill limitations FinRL has. ElegantRL does not improve the realism of trading simulation; it improves how fast you can train against whatever realism the env provides. Second, the parallel-env story is glorious on a workstation GPU and noticeably more complex when the env is not vectorizable on the device — an ABIDES-style LOB simulator does not vectorize on the GPU at all, so ElegantRL's throughput advantage collapses for high-fidelity simulators. Pick ElegantRL when the trainer is your bottleneck and the env is cheap to vectorize; pick something else when the env is the expensive part.

Cross-cutting comparison

Who owns the simulation contract

Who owns the simulation contract Four-column comparison: FinRL hardcodes finance assumptions inside the env (action shape, cost model, episode); TensorTrade leaves the contract for you to assemble through ActionScheme/RewardScheme plug-ins; ABIDES-Gym derives the contract from a discrete-event LOB simulator; ElegantRL inherits whatever the env plug-in defines and focuses on the algorithm side. Who owns the simulation contract FinRL Library owns it. Action shape, cost bps, reward = ΔPV, episode = one day — all hardcoded. You inherit and override. Trade: convenience for control. TensorTrade You own it. Pass an Action- Scheme + Reward- Scheme + Exchange + Slippage model. Env composes them at construction. Trade: assembly cost up front. ABIDES-Gym Simulator owns it. Action = real order type, fills come from queue, latency is modeled. Episode = wall-clock microseconds. Trade: realism vs speed. ElegantRL Env plug-in owns it. Trainer is agnostic; finance is just one of many envs. Demos use a vec- torized stock env. Trade: contract depends on which env you bolt on.
The headline axis the feature lists hide. Four answers to "where do the action shape, fill model, slippage, reward, and episode boundary actually come from?"

Strip everything else away and this is the axis that decides whether the right project is the one you started with. FinRL writes the contract for you — action space is a position vector, fills happen at close, costs are bps, episodes are days — which is exactly right when "trade a US-equity basket on daily bars" is the actual task, and exactly wrong when it is not. TensorTrade hands you the contract as a checklist of plug-ins: pick the ActionScheme, pick the RewardScheme, pick the Exchange, and the env is the composition of those choices — which is liberating once you know what you want and exhausting before then. ABIDES-Gym does not let you choose the contract; the simulator chooses it for you, and the simulator chose "what an order book actually does," so your action space is real order types and your fills come from queue position. ElegantRL is the meta-position: the contract is whatever env plug-in you pass to its trainer, so the question for ElegantRL users is whose env you adopted, not what ElegantRL itself believes. If your research question is about the algorithm (better credit assignment, better exploration), ElegantRL's neutrality is the point; if your research question is about the market, the contract has to come from somewhere with an opinion.

RL machinery — which algorithms are first-class

RL machinery — first-class algorithms and trainer integration Four-column comparison: FinRL wraps Stable-Baselines3/RLlib/ElegantRL behind a DRLAgent facade; TensorTrade leans on Ray RLlib and supports any Gymnasium-compatible trainer; ABIDES-Gym is a Gym wrapper so trainer choice is yours; ElegantRL ships its own PyTorch implementations of PPO/SAC/A2C/TD3 designed for massively parallel rollouts. RL machinery — first-class algorithms and trainer integration FinRL DRLAgent facade. Wraps SB3, RLlib, or ElegantRL via a shared train() / predict() API. First class: PPO, A2C, SAC, TD3, DDPG (single-asset + multi-asset). TensorTrade No bundled trainer. Env is Gymnasium- compliant; SB3, RLlib, CleanRL all work unchanged. Tutorials lean on Ray Tune + RLlib for hyper-param search. ABIDES-Gym Just a Gym wrapper on the simulator. Trainer = your call. Hits a single-thread discrete-event loop, so vec-env parallel is process-level, not on-device. SB3/RLlib typical. ElegantRL Its own PyTorch implementations. PPO, SAC, TD3, A2C, DDPG, DQN, D3QN, REDQ. Built for vectorized GPU rollouts and multi-pod tournament ensembles.
Who implements the agent loop — and what that implementation is optimized for.

All four projects nominally support the same canonical algorithms (PPO, SAC, A2C, DQN, often DDPG and TD3) — what differs is who implements them and what those implementations are tuned for. FinRL does not implement any of them itself; DRLAgent is a facade that delegates to whichever of SB3, RLlib, or ElegantRL you import, which means your algorithm is really their algorithm and you inherit their churn. TensorTrade also implements nothing — the env is a Gymnasium contract and you bring SB3 or Ray RLlib or CleanRL — which keeps the project small but means upgrades happen out-of-band. ABIDES-Gym is again a Gym wrapper, so SB3 or RLlib is the typical pairing; the reference paper used a custom Deep Dueling Double Q-learning agent with APEX prioritized replay because that combination matched the execution-quality task. ElegantRL is the one that wrote its own implementations end-to-end, optimized for keeping rollouts on the GPU as device tensors rather than CPU queues — the throughput delta against SB3 is large when the env vectorizes on-device, and irrelevant when it does not. If the trainer is your bottleneck, ElegantRL is the answer; if integration with the rest of your team's stack matters more, SB3-backed (which FinRL, TensorTrade, and ABIDES-Gym all support) is the safer choice.

Realism vs throughput

Realism vs throughput — what each library trades Four-column comparison: FinRL uses OHLCV bars and fills at close — fast but coarse; TensorTrade lets the user pick fidelity by exchange plug-in; ABIDES-Gym simulates a tick-by-tick LOB with realistic latency — high fidelity, slowest; ElegantRL hits the highest training throughput by running thousands of envs on a GPU but its trading env is OHLCV-shaped, not microstructure-aware. Realism vs throughput — what each library trades FinRL Daily/minute OHLCV. Fills at close price. No market impact, no queue position. Throughput: fast enough for daily-bar portfolio research, slow for intra-day. Fidelity: lowest. TensorTrade You pick: pass an Exchange + slippage model with the fidelity you need. Default: OHLCV + commission %. Real-broker plug-ins exist for live mode. Fidelity: bounded by the plug-in. ABIDES-Gym Tick-by-tick LOB. Realistic latency, queue position, partial fills, endogenous impact from other agents. Slowest by far — simulator is single- threaded discrete events. Fidelity: highest. ElegantRL Trainer wins at throughput: 4 k – 16 k envs on one GPU, no CPU copy. But the trading env it ships with is OHLCV, not LOB. Fidelity: low. Throughput: highest.
The trade is real: every project pays for fidelity in throughput or vice versa.

Every choice here is a position on the same trade-off curve. FinRL accepts that fills at close-price are unrealistic in exchange for daily-bar simulations that run in seconds — fine for portfolio allocation research, dangerous for short-horizon trading where slippage is the whole story. TensorTrade puts the choice in your hands: fidelity is whatever the Exchange plug-in you wrote provides, with realistic options requiring real engineering. ABIDES-Gym accepts that the simulator is single-threaded discrete-event Python in exchange for queue-position-accurate fills, modeled latency, and endogenous market impact from background traders — for execution-style research where the only honest answer is "trade against an order book," nothing else in this set is in the same league. ElegantRL inverts the question entirely: assume the env is cheap and ask how many parallel rollouts a single GPU can sustain, then scale that across pods. The result is glorious training throughput when the env is vectorizable on-device (Isaac Gym, OHLCV stock env) and a hard wall when it is not (ABIDES, anything message-driven). The pragmatic two-phase pattern is to use a fast, lower-fidelity env for hyper-parameter search and policy class selection, then validate the chosen policy on a high-fidelity env before believing the result — the four projects in this set neatly cover the two halves of that pattern.

When to pick which

Use case Pick FinRL if… Pick TensorTrade if… Pick ABIDES-Gym if… Pick ElegantRL if…
Daily-bar portfolio research Yes — the default env is exactly this; one notebook gets you running. Workable, but you assemble the contract yourself. Overkill — the LOB simulator is wasted at daily frequency. Use it as the trainer behind FinRL-Meta's vec-env stock task.
Custom action / reward / cost model You subclass the env — possible but fights the framework. Yes — plug in your own ActionScheme / RewardScheme / Exchange. Action shape is the simulator's; reward is yours to define. Env is yours to design; the trainer does not care.
Execution / market-making research Wrong abstraction — fills at close hide the problem. Possible with a custom LOB Exchange plug-in, but real work. Yes — designed for this; queue position and latency are modeled. Not on its own; ElegantRL trainer is fine but it needs an LOB env.
Trainer throughput is the bottleneck Switch the FinRL backend to ElegantRL — it is supported. SB3 with SubprocVecEnv; CPU-bound. Single-thread simulator is the bottleneck, not the trainer. Yes — vectorized GPU rollouts are the design point.
Newcomer wanting the quickest first agent Yes — the on-ramp is the shortest of the four. Longer ramp; you assemble before you train. Steepest learning curve — also learning ABIDES. RL-engineering knowledge assumed; not the gentlest entry.

FAQ

If FinRL already wraps ElegantRL, why pick ElegantRL directly?

FinRL's DRLAgent facade is convenient but lossy — to keep one shared interface across SB3 / RLlib / ElegantRL it has to expose the lowest common denominator of their APIs. Going to ElegantRL directly buys you access to the vectorized-env trainer loop, the on-device replay buffer, and the Podracer multi-pod scale-out — none of which is reachable through the facade. The right rule is: start with FinRL when the question is "does this policy class learn anything?" and drop to ElegantRL when the question is "how fast can I sweep hyper-parameters?"

Is "MarketGym" a real project? Why is this post about ABIDES-Gym instead?

"MarketGym" appears in some survey papers as a generic label for Gym-style market environments, but there is no single canonical project under that name that is actively maintained in 2026 — closest matches like Yvictor/TradingGym, thedimlebowski/Trading-Gym, and hackthemarket/gym-trading have been quiet for years. The acceptable substitute within the same paradigm is the LOB / microstructure-realism slot, and there ABIDES-Gym (J.P. Morgan AI Research, on top of the ABIDES simulator) is the live, well-cited reference implementation. We swap it in explicitly rather than silently to keep the comparison honest. If you saw "MarketGym" in a 2022-era paper and were chasing the same idea, ABIDES-Gym is what you actually want.

Can I just use Stable-Baselines3 directly with a custom trading env and skip all four?

Yes, and many teams do — once you understand what the simulation contract should be, "SB3 + your env" is the smallest dependency footprint of the lot. The reason these four projects exist is that writing a good trading env is hard, and each one front-loads a different chunk of that work: FinRL writes the env for you, TensorTrade writes the env's plumbing for you, ABIDES-Gym writes the simulator for you, and ElegantRL writes the trainer for you. You can absolutely skip them; you will just write more of it yourself.

How does RL-for-trading relate to RL-for-tool-use, which the rest of the wiki talks about?

They share the deep machinery (policy gradients, value functions, exploration) and split sharply on the credit-assignment story. RL-for-trading has a relatively dense reward — every step produces a PnL delta — and the hard part is whether the env's slippage and cost assumptions match the real venue. RL for agentic tool use has a sparse, often-terminal reward and the hard part is whether the verifier you reward against is trustworthy. The intuitions covered in RL for tool use and reward design and hacking transfer over directly — and the trading-specific failure mode (a policy that "wins" because the slippage model is too kind) is exactly the reward-hacking pattern in a financial costume.

Which of these is closest to a "real" trading agent in production?

None of the four is a production trading system on its own — they are research and prototyping platforms. The honest pipeline is: prototype with FinRL or TensorTrade to validate the policy class, train at scale with ElegantRL once the env is vectorizable, validate execution behavior on ABIDES-Gym against background-trader populations before believing any backtest, then port the chosen policy out to whatever execution gateway your venue exposes. Treating any single one as the whole stack is exactly the failure mode this post is written against.

Does it matter that FinRL and ElegantRL are both AI4Finance projects?

It matters in a good way: they are designed to compose. FinRL imports ElegantRL as one of its backend options, FinRL-Meta supplies vectorized envs that ElegantRL trains efficiently, and the FinRL-Podracer paper shows the cloud-native scale-out story end-to-end. The downside is the obvious one — if you are betting on an organization, you are betting on one organization across two libraries. Diversifying upstream is exactly what TensorTrade (community-maintained) and ABIDES-Gym (J.P. Morgan) buy you for half of the stack.

Further reading

On this wiki:

  • The agent loop — the observe → decide → act → reward cycle that every Gymnasium trading env implements, and the same cycle an agentic AI runs at a different timescale.
  • Tools, actions, and environments — why the env contract is the most consequential design choice in any RL system, and what makes one env contract more honest than another.
  • Prompt, fine-tune, or RL? — the decision tree that puts RL in context: pick the cheapest lever that closes the gap, and only reach for RL when the others run out.
  • RL for tool use and multi-step tasks — why RL over multi-step trajectories is hard, and why a trustworthy verifier (or slippage model, or LOB simulator) is the whole game.
  • Reward design and reward hacking — the reward is always a proxy; in trading, the slippage and cost assumptions are the proxy, and a backtest that looks too good is reward hacking in a financial costume.
  • RLHF and RLAIF — the algorithmic family (PPO, GRPO, DPO) these trading libraries also draw from, and why the same engineering primitives reappear across domains.
  • AI in the trading stack — the four-layer landscape (signal · sizing · execution · risk) that locates these RL libraries inside a working trading shop: they live almost entirely in the execution layer.
  • Agentic AI for trading research — the companion post on LLM-agent firms upstream of the RL execution layer; together with this comparison, the two posts cover the LLM-and-RL split of the agentic trading stack.

Project sources:

  • FinRL repo — the original framework, with StockTradingEnv, DRLAgent, and bundled tutorials.
  • FinRL-Meta — the dynamic-dataset and vectorized-env layer used by FinRL and ElegantRL.
  • TensorTrade repo — the composable Gymnasium env library, with ActionScheme / RewardScheme / Exchange plug-ins.
  • ABIDES-Gym / abides-jpmc-public — the public ABIDES distribution with the ABIDES-Markets and ABIDES-Gym extensions.
  • ElegantRL repo — the massively parallel deep-RL library and Podracer cloud-native trainer.