FinRL vs TensorTrade vs ABIDES-Gym vs ElegantRL: Who Controls the Simulation Contract

Q: If FinRL already wraps ElegantRL, why pick ElegantRL directly?

FinRL's DRLAgent facade is convenient but lossy — to keep one shared interface across SB3 / RLlib / ElegantRL it has to expose the lowest common denominator of their APIs. Going to ElegantRL directly buys you access to the vectorized-env trainer loop, the on-device replay buffer, and the Podracer multi-pod scale-out — none of which is reachable through the facade. The right rule is: start with FinRL when the question is "does this policy class learn anything?" and drop to ElegantRL when the question is "how fast can I sweep hyper-parameters?"

Read the README of any of these four projects and the feature lists look almost identical: a Gymnasium env, OHLCV ingest, PPO/SAC/A2C/DQN out of the box, a backtest curve. The thing that decides which one survives a serious research-or-prod loop is invisible there: who controls the simulation contract — the action shape, the cost model, the slippage assumption, the reward, where an episode ends. FinRL bakes opinionated finance assumptions into the env so you trade off control for convenience. TensorTrade keeps the env data-source-agnostic and asks you to assemble the contract from plug-ins. ABIDES-Gym derives the contract from a discrete-event LOB simulator that other agents trade inside. ElegantRL is an RL-engineering library that treats trading as one application — the env is your problem, the trainer's job is to be fast. Pick on that axis first; everything else follows.

At a glance

Four RL-for-trading projects, four answers to the same question — what does the env decide for you, and what do you have to bring yourself. The table lists the basics; the matrix below it shows where each one leans hardest across the axes that actually differ.

Project	Released / maintainer	Primary niche	Where it runs
FinRL	2020, AI4Finance Foundation	End-to-end financial RL framework with baked-in finance assumptions	Local Python / notebook; demos for cloud GPUs
TensorTrade	2019, tensortrade-org (community)	Composable Gymnasium env assembled from action / reward / exchange plug-ins	Local Python; Ray Tune for scale-out
ABIDES-Gym	2021, J.P. Morgan AI Research	Gym wrapper around the ABIDES multi-agent LOB simulator	Local Python; single-threaded simulator process
ElegantRL	2021, AI4Finance Foundation	Cloud-native, massively parallel deep-RL library (finance is one app)	Single GPU; multi-pod cluster via Podracer

Snapshot: 2026-06-02. The "MarketGym" name circulating in survey papers is not a single canonical project — we substitute ABIDES-Gym as the LOB-microstructure peer because it is the actively maintained reference implementation. Frameworks move quickly; verify against current docs.

Where each project leans hardest. The axes converge on the surface and pull apart where the simulation contract is decided.

FinRL — deep dive

FinRL is a vertically integrated stack: data → finance-opinionated env → DRLAgent facade → backtest, with the simulation contract baked into the env layer.

The simulation contract

FinRL ships StockTradingEnv, PortfolioOptimizationEnv, and a small zoo of crypto / paper-trading envs whose defaults are not neutral — they are finance opinions written into code. Action space is Box([-1, 1], n_stocks) interpreted as a fraction of position scaled by hmax; reward is the change in portfolio value with a bps transaction cost subtracted; fills happen at the close price of the bar; an episode ends when cash runs out or the price series does; and there is an optional turbulence index that flattens positions during volatile regimes. None of these are negotiable through configuration knobs alone — to change them you subclass the env and override the relevant methods. That is the FinRL deal: the contract is the framework's answer to "how a portfolio agent trades a basket of US stocks at daily frequency," and most of the work is already done.

Agent and trainer

The agent layer is a thin DRLAgent facade that delegates to Stable-Baselines3 (PPO, A2C, SAC, TD3, DDPG), RLlib, or ElegantRL by which import path you load. A single agent.train(...) / agent.predict(...) pair hides the trainer API, which is the whole reason FinRL is a popular on-ramp: copy the notebook, plug your tickers in, and a working PPO portfolio agent runs end-to-end in under an hour. The cost of that on-ramp is that the trainer is somebody else's — when SB3 deprecates an arg or RLlib changes its config shape, you inherit the churn. FinRL-Meta and the newer FinRL-X repo modernize the data layer and the deployment story, but the core "agent over baked env" pattern is the same.

What the runtime makes hard

Two things. First, escaping the contract once you outgrow it. If you want intra-day fills with realistic slippage, market impact, or a non-Box action space (place a limit order at a price level, cancel a working order), you are either subclassing the env or rewriting it — the framework helps you less the further you stray. Second, validating that the trained policy survives a real exchange's microstructure. The default close-price fill is a generous assumption; PnL curves in the FinRL backtest can be optimistic relative to anything with an order book, and FinRL does not warn you about that — the burden is on you to know it. The pragmatic move when these bite is to use FinRL for data prep and baselines and run final evaluation on an LOB simulator like ABIDES-Gym.

TensorTrade — deep dive

TensorTrade is composable: the env is a thin shell that wires together the plug-ins you pass in, so the simulation contract is whatever you assembled.

The simulation contract

TensorTrade's TradingEnv is intentionally hollow. The action contract comes from an ActionScheme object you supply (the bundled ones are BSH, a binary buy-sell-hold, and ManagedRiskOrders for stop-loss / take-profit; everything else is custom). The reward contract comes from a RewardScheme (SimpleProfit, RiskAdjustedReturns for Sharpe-ish, or your own). The exchange contract comes from an Exchange object that defines the slippage and commission models. Asset universe and currency are typed objects you instantiate before passing them in. This means there is no "default trading env" in the FinRL sense — the env is the composition of choices you made at construction time, and two TensorTrade users with the same library are usually running two materially different simulations.

Agent and trainer

TensorTrade does not bundle an RL implementation. The env is a Gymnasium-compatible object, so any external trainer plugs in directly: the project's tutorials lean on Ray RLlib through Ray Tune for hyper-parameter search, but Stable-Baselines3 and CleanRL work without modification because the contract is just step / reset / observation_space / action_space. This is a deliberate division of labor — TensorTrade is the env library, not the agent library, which means when SB3 ships a new PPO variant or CleanRL adds a recurrent-policy fix you inherit it for free, and when you want to swap PPO for SAC you change one import line, not the framework.

What the runtime makes hard

Composability is a tax. Spinning up a first TensorTrade env is meaningfully more work than spinning up a first FinRL env because there is no opinionated default — you choose the action scheme, the reward shape, the slippage model, and the data feed before anything runs. For teams that know what shape they want this is a feature; for newcomers without a strong opinion it is a blank canvas that delays the first learning curve. The other sharp edge is that fidelity is bounded by the plug-in. The default Exchange uses commission percentages and an OHLCV feed; if you want queue position and realistic latency you have to bring your own simulator behind the Exchange interface, which is doable but not what the library hands you. The honest framing: TensorTrade gives you the contract you asked for, not the contract the market gives you.

ABIDES-Gym — deep dive

ABIDES-Gym wraps a discrete-event multi-agent LOB simulator: the agent trades inside a synthetic exchange populated by other agents that generate realistic order flow.

The simulation contract

ABIDES-Gym is not an RL framework in the FinRL or TensorTrade sense — it is a Gym wrapper bolted onto the ABIDES discrete-event simulator. The simulation contract is whatever the simulator says happens: the exchange agent maintains a real limit order book with price-time priority and discrete tick sizes; background trader agents (noise, value, momentum, reference market makers) act on their own arrival schedules and place real LIMIT / MARKET / CANCEL messages that move the book; latency is modeled at every hop; partial fills and queue position fall out of the kernel rather than being approximated. Your RL agent is one more agent inside the kernel — the "experimental" agent — driven by a Gymnasium step() that resumes the simulator until the agent's next decision moment. Actions are real order types (place, cancel, modify); observations are LOB snapshots; rewards are realized PnL slices or an execution-quality benchmark such as Almgren-Chriss.

Agent and trainer

The repo ships two benchmark envs — a daily-investor env and an execution / liquidation env — plus the simulator and example training scripts. The agent and trainer themselves are your problem: ABIDES-Gym exposes a Gym interface, so SB3 PPO, RLlib, or custom PyTorch all plug in directly. The reference ABIDES-Gym paper uses Deep Dueling Double Q-learning with the APEX architecture; recent extensions like ABIDES-MARL adapt the kernel for multi-agent RL where several adaptive agents learn simultaneously inside the same book. The trainer side is intentionally thin: the value here is the simulator, not the policy code.

What the runtime makes hard

Throughput. ABIDES is single-threaded discrete-event Python — the message bus processes one event at a time, with priority-queue ordering by arrival timestamp. A day of simulated NASDAQ trading takes meaningful wall-clock time, and you cannot trivially vectorize the simulator the way you would a stateless OHLCV env. The standard workaround is process-level parallelism (multiple simulator processes through SubprocVecEnv), which scales linearly with CPUs but not with GPUs. The second sharp edge is calibration: the background-trader populations have parameters (noise-trader arrival rate, value-trader signal noise, market-maker spread) that you have to tune to get order flow that resembles a real venue, and "resembles" is a judgement call. The honest position: ABIDES-Gym gives you the most realistic simulation contract in this set, and the price is single-process throughput plus a calibration project.

ElegantRL — deep dive

ElegantRL is RL-engineering-first: massively parallel env rollouts on one GPU, clean Actor/Critic PyTorch primitives, and an env plug-in slot where finance is one task among many.

The simulation contract

ElegantRL is the odd one out: it is an RL library, not a trading framework. The simulation contract is whatever the env you bolt on says it is. The shipped finance demos use a FinRL-Meta StockTradingEnv running as a vectorized tensor on the GPU — share-prices indexed as one dimension of a batch, instead of robot state — so the contract there is an OHLCV-shaped one inherited from the FinRL-Meta repo. The library's center of gravity is somewhere else: an Isaac-Gym-style vectorized env loop that runs 4 k–16 k parallel envs on a single GPU with no CPU-side copy, a clean Actor/Critic separation, and a Podracer cloud-native layer that scales the same code to hundreds of GPUs via a tournament ensemble of pods.

Agent and trainer

This is the half ElegantRL takes most seriously. The repo ships its own clean-room PyTorch implementations of DQN, Double DQN, D3QN, REDQ, A2C, PPO, DDPG, TD3, SAC — all conforming to a shared Agent.update_net(buffer) contract, with twin-Q targets, GAE, KL leashes, and the rest of the engineering primitives. The point is not "we re-implement PPO" — it is "we re-implemented PPO so the rollouts can stay on-device and the buffer can be a GPU tensor, not a CPU queue." Compared with SB3 (which prioritizes algorithmic clarity over raw throughput) and RLlib (which prioritizes cluster scale-out over per-GPU efficiency), ElegantRL chose throughput per GPU as its design point.

What the runtime makes hard

Two things. First, the finance env you get is not the library's center of gravity — it is a port of FinRL-Meta running as a vec-env, which means the simulation contract is still OHLCV with the same close-price-fill limitations FinRL has. ElegantRL does not improve the realism of trading simulation; it improves how fast you can train against whatever realism the env provides. Second, the parallel-env story is glorious on a workstation GPU and noticeably more complex when the env is not vectorizable on the device — an ABIDES-style LOB simulator does not vectorize on the GPU at all, so ElegantRL's throughput advantage collapses for high-fidelity simulators. Pick ElegantRL when the trainer is your bottleneck and the env is cheap to vectorize; pick something else when the env is the expensive part.

Cross-cutting comparison

Who owns the simulation contract

The headline axis the feature lists hide. Four answers to "where do the action shape, fill model, slippage, reward, and episode boundary actually come from?"

Strip everything else away and this is the axis that decides whether the right project is the one you started with. FinRL writes the contract for you — action space is a position vector, fills happen at close, costs are bps, episodes are days — which is exactly right when "trade a US-equity basket on daily bars" is the actual task, and exactly wrong when it is not. TensorTrade hands you the contract as a checklist of plug-ins: pick the ActionScheme, pick the RewardScheme, pick the Exchange, and the env is the composition of those choices — which is liberating once you know what you want and exhausting before then. ABIDES-Gym does not let you choose the contract; the simulator chooses it for you, and the simulator chose "what an order book actually does," so your action space is real order types and your fills come from queue position. ElegantRL is the meta-position: the contract is whatever env plug-in you pass to its trainer, so the question for ElegantRL users is whose env you adopted, not what ElegantRL itself believes. If your research question is about the algorithm (better credit assignment, better exploration), ElegantRL's neutrality is the point; if your research question is about the market, the contract has to come from somewhere with an opinion.

RL machinery — which algorithms are first-class

Who implements the agent loop — and what that implementation is optimized for.

All four projects nominally support the same canonical algorithms (PPO, SAC, A2C, DQN, often DDPG and TD3) — what differs is who implements them and what those implementations are tuned for. FinRL does not implement any of them itself; DRLAgent is a facade that delegates to whichever of SB3, RLlib, or ElegantRL you import, which means your algorithm is really their algorithm and you inherit their churn. TensorTrade also implements nothing — the env is a Gymnasium contract and you bring SB3 or Ray RLlib or CleanRL — which keeps the project small but means upgrades happen out-of-band. ABIDES-Gym is again a Gym wrapper, so SB3 or RLlib is the typical pairing; the reference paper used a custom Deep Dueling Double Q-learning agent with APEX prioritized replay because that combination matched the execution-quality task. ElegantRL is the one that wrote its own implementations end-to-end, optimized for keeping rollouts on the GPU as device tensors rather than CPU queues — the throughput delta against SB3 is large when the env vectorizes on-device, and irrelevant when it does not. If the trainer is your bottleneck, ElegantRL is the answer; if integration with the rest of your team's stack matters more, SB3-backed (which FinRL, TensorTrade, and ABIDES-Gym all support) is the safer choice.

Realism vs throughput

The trade is real: every project pays for fidelity in throughput or vice versa.

Every choice here is a position on the same trade-off curve. FinRL accepts that fills at close-price are unrealistic in exchange for daily-bar simulations that run in seconds — fine for portfolio allocation research, dangerous for short-horizon trading where slippage is the whole story. TensorTrade puts the choice in your hands: fidelity is whatever the Exchange plug-in you wrote provides, with realistic options requiring real engineering. ABIDES-Gym accepts that the simulator is single-threaded discrete-event Python in exchange for queue-position-accurate fills, modeled latency, and endogenous market impact from background traders — for execution-style research where the only honest answer is "trade against an order book," nothing else in this set is in the same league. ElegantRL inverts the question entirely: assume the env is cheap and ask how many parallel rollouts a single GPU can sustain, then scale that across pods. The result is glorious training throughput when the env is vectorizable on-device (Isaac Gym, OHLCV stock env) and a hard wall when it is not (ABIDES, anything message-driven). The pragmatic two-phase pattern is to use a fast, lower-fidelity env for hyper-parameter search and policy class selection, then validate the chosen policy on a high-fidelity env before believing the result — the four projects in this set neatly cover the two halves of that pattern.

When to pick which

Use case	Pick FinRL if…	Pick TensorTrade if…	Pick ABIDES-Gym if…	Pick ElegantRL if…
Daily-bar portfolio research	Yes — the default env is exactly this; one notebook gets you running.	Workable, but you assemble the contract yourself.	Overkill — the LOB simulator is wasted at daily frequency.	Use it as the trainer behind FinRL-Meta's vec-env stock task.
Custom action / reward / cost model	You subclass the env — possible but fights the framework.	Yes — plug in your own ActionScheme / RewardScheme / Exchange.	Action shape is the simulator's; reward is yours to define.	Env is yours to design; the trainer does not care.
Execution / market-making research	Wrong abstraction — fills at close hide the problem.	Possible with a custom LOB Exchange plug-in, but real work.	Yes — designed for this; queue position and latency are modeled.	Not on its own; ElegantRL trainer is fine but it needs an LOB env.
Trainer throughput is the bottleneck	Switch the FinRL backend to ElegantRL — it is supported.	SB3 with SubprocVecEnv; CPU-bound.	Single-thread simulator is the bottleneck, not the trainer.	Yes — vectorized GPU rollouts are the design point.
Newcomer wanting the quickest first agent	Yes — the on-ramp is the shortest of the four.	Longer ramp; you assemble before you train.	Steepest learning curve — also learning ABIDES.	RL-engineering knowledge assumed; not the gentlest entry.

FAQ

If FinRL already wraps ElegantRL, why pick ElegantRL directly?

FinRL's DRLAgent facade is convenient but lossy — to keep one shared interface across SB3 / RLlib / ElegantRL it has to expose the lowest common denominator of their APIs. Going to ElegantRL directly buys you access to the vectorized-env trainer loop, the on-device replay buffer, and the Podracer multi-pod scale-out — none of which is reachable through the facade. The right rule is: start with FinRL when the question is "does this policy class learn anything?" and drop to ElegantRL when the question is "how fast can I sweep hyper-parameters?"

Is "MarketGym" a real project? Why is this post about ABIDES-Gym instead?

"MarketGym" appears in some survey papers as a generic label for Gym-style market environments, but there is no single canonical project under that name that is actively maintained in 2026 — closest matches like Yvictor/TradingGym, thedimlebowski/Trading-Gym, and hackthemarket/gym-trading have been quiet for years. The acceptable substitute within the same paradigm is the LOB / microstructure-realism slot, and there ABIDES-Gym (J.P. Morgan AI Research, on top of the ABIDES simulator) is the live, well-cited reference implementation. We swap it in explicitly rather than silently to keep the comparison honest. If you saw "MarketGym" in a 2022-era paper and were chasing the same idea, ABIDES-Gym is what you actually want.

Can I just use Stable-Baselines3 directly with a custom trading env and skip all four?

Yes, and many teams do — once you understand what the simulation contract should be, "SB3 + your env" is the smallest dependency footprint of the lot. The reason these four projects exist is that writing a good trading env is hard, and each one front-loads a different chunk of that work: FinRL writes the env for you, TensorTrade writes the env's plumbing for you, ABIDES-Gym writes the simulator for you, and ElegantRL writes the trainer for you. You can absolutely skip them; you will just write more of it yourself.

How does RL-for-trading relate to RL-for-tool-use, which the rest of the wiki talks about?

They share the deep machinery (policy gradients, value functions, exploration) and split sharply on the credit-assignment story. RL-for-trading has a relatively dense reward — every step produces a PnL delta — and the hard part is whether the env's slippage and cost assumptions match the real venue. RL for agentic tool use has a sparse, often-terminal reward and the hard part is whether the verifier you reward against is trustworthy. The intuitions covered in RL for tool use and reward design and hacking transfer over directly — and the trading-specific failure mode (a policy that "wins" because the slippage model is too kind) is exactly the reward-hacking pattern in a financial costume.

Which of these is closest to a "real" trading agent in production?

None of the four is a production trading system on its own — they are research and prototyping platforms. The honest pipeline is: prototype with FinRL or TensorTrade to validate the policy class, train at scale with ElegantRL once the env is vectorizable, validate execution behavior on ABIDES-Gym against background-trader populations before believing any backtest, then port the chosen policy out to whatever execution gateway your venue exposes. Treating any single one as the whole stack is exactly the failure mode this post is written against.

Does it matter that FinRL and ElegantRL are both AI4Finance projects?

It matters in a good way: they are designed to compose. FinRL imports ElegantRL as one of its backend options, FinRL-Meta supplies vectorized envs that ElegantRL trains efficiently, and the FinRL-Podracer paper shows the cloud-native scale-out story end-to-end. The downside is the obvious one — if you are betting on an organization, you are betting on one organization across two libraries. Diversifying upstream is exactly what TensorTrade (community-maintained) and ABIDES-Gym (J.P. Morgan) buy you for half of the stack.

FinRL vs TensorTrade vs ABIDES-Gym vs ElegantRL: Who Controls the Simulation Contract

At a glance

FinRL — deep dive

The simulation contract

Agent and trainer

What the runtime makes hard

TensorTrade — deep dive

The simulation contract

Agent and trainer

What the runtime makes hard

ABIDES-Gym — deep dive

The simulation contract

Agent and trainer

What the runtime makes hard

ElegantRL — deep dive

The simulation contract

Agent and trainer

What the runtime makes hard

Cross-cutting comparison

Who owns the simulation contract

RL machinery — which algorithms are first-class

Realism vs throughput

When to pick which

FAQ

If FinRL already wraps ElegantRL, why pick ElegantRL directly?

Is "MarketGym" a real project? Why is this post about ABIDES-Gym instead?

Can I just use Stable-Baselines3 directly with a custom trading env and skip all four?

How does RL-for-trading relate to RL-for-tool-use, which the rest of the wiki talks about?

Which of these is closest to a "real" trading agent in production?

Does it matter that FinRL and ElegantRL are both AI4Finance projects?

Further reading

On this wiki:

Project sources: