AI Blog

Llama 4 vs DeepSeek V3 vs Qwen3 vs Mistral Large 3: Four Open-Weights Flagships, Four Different Bets

Every few months, four labs ship a similar-sounding open-weights flagship — MoE, long context, reasoning mode, multimodal. The benchmarks keep getting passed back and forth. The thing that actually decides which one you run in production is the axis each lab is betting on next: multimodal ecosystem, inference economics, agentic reasoning, or permissive-license frontier intelligence.

By Agentic AI Wiki 27 min read

Read the spec sheets and these four open-weights flagships sound like the same model with different stickers: MoE backbone, long context, reasoning mode, tool use, some flavour of multimodal. The thing that actually decides which one you self-host in production is invisible there: the axis each lab is betting on next. Llama 4 doubles down on native multimodality and an open ecosystem. DeepSeek V3.2 attacks inference economics with MLA, sparse attention, and FP8 training. Qwen3 maximizes language coverage and agent-shaped reasoning. Mistral Large 3 puts a frontier-scale MoE under Apache 2.0 for teams that need a permissive license a regulator will sign off on. Pick on the axis; the benchmarks will keep moving.

At a glance

Four labs, four answers to the same question — what do you optimize when everyone has MoE, long context, and a reasoning mode? The table lists the basics; the matrix below it shows where each model leans hardest across the axes that genuinely differ in production.

Model Released / maintainer Architecture License
Llama 4 (Scout / Maverick) April 2025, Meta MoE, native multimodal Llama 4 Community License
DeepSeek V3.2 December 2025, DeepSeek (V3 family since Dec 2024) MoE, MLA + DeepSeek Sparse Attention, FP8 MIT
Qwen3 (235B-A22B + dense siblings) April 2025, Alibaba (Instruct/Thinking-2507 update) MoE flagship + dense 0.6B → 32B, hybrid /think Apache 2.0
Mistral Large 3 December 2025, Mistral AI Sparse MoE, image inputs Apache 2.0

Snapshot: 2026-06-02. Open-weights flagships move fast — verify against current model cards before committing.

Open-weights flagship feature matrix Heatmap comparing Llama 4, DeepSeek V3.2, Qwen3, and Mistral Large 3 across six axes: native multimodality, inference economics, agentic reasoning depth, language coverage, permissive license, and ecosystem reach. Strength indicated by fill color from light (weak) to dark (strong). Open-weights flagship feature matrix Native multimodal Inference economics Agentic reasoning Language coverage Permissive license Ecosystem reach Llama 4 Native early fusion 10M ctx, INT4 No native /think 12 official, English-leaning Custom license, 700M MAU cap Biggest; every cloud + stack DeepSeek V3.2 Text only MLA + DSA + FP8 Thinking + tool use EN + ZH strong MIT Strong serving, growing fine-tunes Qwen3 Qwen3-VL sibling Hybrid /think, dense siblings Hybrid + Qwen-Agent 100 + langs, CJK depth Apache 2.0 HF + ModelScope, Qwen-Agent SDK Mistral Large 3 Image inputs 128K ctx, priciest active Tool-use, reasoning sibling soon 40 + langs, EU depth Apache 2.0, frontier scale Big clouds, EU-hosted API Weak Medium Strong
Where each flagship leans hardest. Notice how the matrix is jagged — nobody is strong everywhere, and the strengths don't overlap much.

Llama 4 — deep dive

Llama 4 architecture Llama 4 is a natively multimodal MoE family. Text and image tokens enter a shared early-fusion transformer; a router activates 17B of 109B (Scout) or 400B (Maverick) parameters per token. The license, weights, and a large ecosystem of fine-tunes and inference engines define the production story. Text tokens prompt + chat history Image patches ViT encoder, patch tokens Early-fusion text + image into ONE stream of tokens native multimodality MoE router picks top-k experts per token Scout: 1 of 16 Maverick: 1 of 128 + shared ~17B active per token 109B / 400B total Expert 1 · FFN Expert 2 · FFN Expert k · FFN (active) … 16 / 128 total Long-context attention Scout: up to 10M tokens · Maverick: 1M iRoPE-style scaling for ultra-long context Text + code output (12 + langs) multimodal in, text out tool calls, structured outputs, agents Open ecosystem you actually deploy into Llama 4 Community License free under 700M MAU acceptable-use policy Weights on Hugging Face BF16; int4 fits one H100 llama.com / hf.co/meta-llama Serving stacks vLLM, TGI, llama.cpp Bedrock, Groq, Together Downstream fine-tunes LoRA, full SFT, RLHF thousands of community ckpts
Llama 4 fuses text and image patches into one stream before routing — multimodality is the architecture, not a bolt-on adapter — and the ecosystem strip is the part the spec sheet hides.

Architecture and training bet

Llama 4 is Meta's first MoE Llama and the first with native multimodality: text tokens and image patches enter the same transformer through an early-fusion path, so the model attends across modalities rather than calling a separate vision encoder as a tool. Scout uses 16 experts with about 17B active of 109B total parameters and pushes context to 10M tokens via an iRoPE-style scaling scheme — the longest window of the four. Maverick trades context (1M) for capacity, with 128 experts and 400B total. A larger Behemoth teacher model was previewed but has not been released. The training run uses ~40T tokens for Scout and a vision-language curriculum that includes interleaved image-text documents from the start.

Shipping it in production

The license is the Llama 4 Community License, not Apache or MIT: free for anyone under 700M monthly active users, an acceptable-use policy a lawyer should read, and a clause that says derivative model names must start with "Llama". Practically, every cloud and inference vendor — Bedrock, Vertex AI, Azure AI Foundry, Groq, Together, Fireworks — has Llama 4 day-one, vLLM and TGI ship optimized kernels, and llama.cpp/Ollama serve the int4 quants on a single H100 or a beefy Mac. The community fine-tune scene is by far the deepest of the four; thousands of LoRAs and full SFTs land on Hugging Face weekly. Weights live at llama.com and hf.co/meta-llama.

The axis Llama 4 is betting on

Open multimodal ecosystem at scale. Meta is not racing the others on raw MMLU or AIME — it is betting that the open-weights model that wins is the one with the most native multimodal capability, the most ubiquitous serving footprint, and the most downstream fine-tunes. If your product reads images, your team prefers the lowest-friction serving path, or you are building on a stack that already speaks Llama, this is the easy default. The cost is the license — if you are a hyperscaler-scale consumer product or a regulated buyer who needs Apache or MIT for procurement, Llama 4 is not the one you get past legal without a conversation.

DeepSeek V3.2 — deep dive

DeepSeek V3 / V3.2 architecture DeepSeek V3.2 is a 685B-parameter MoE that activates ~37B per token. Multi-head Latent Attention compresses the KV cache; DeepSeek Sparse Attention prunes long-context attention; FP8 mixed-precision training and an auxiliary-loss-free routing strategy cut per-token cost. The bet is inference economics at frontier quality. Token stream — 128K context, V3.2 trains for tool-use and long reasoning chains thinking / non-thinking modes share one set of weights Multi-head Latent Attention compresses KV into low-rank latents cuts KV cache by ~7× keeps attention quality long-context throughput goes up DeepSeek Sparse Attention learned sparsity pattern attends to a subset of past tokens reduces long-context compute added in V3.2 Auxiliary-loss-free MoE 256 routed + 1 shared expert ~37B active of 685B total bias-only load balancing no router penalty term FP8 mixed-precision training forward and backward pass in FP8 master weights / optimizer in BF16 ~half the GPU hours of dense peers at this quality Serving footprint FP8 / INT4 inference supported out of the box a single 8×H200 / 8×H100 node serves V3 well SGLang, vLLM, TGI, llama.cpp adapters License + ecosystem MIT license — fully open Weights on Hugging Face Tech reports + training code
DeepSeek's distinguishing pattern is how every layer is shaped around per-token cost: KV compression, sparse attention, fine-grained MoE, FP8 mixed precision — quality at a fraction of the per-token GPU spend.

Architecture and training bet

DeepSeek V3.2 is a 685B-parameter MoE with roughly 37B active per token, descended from V3 (December 2024) through V3.1 (August 2025, hybrid reasoning/non-reasoning) to V3.2 (December 2025, which introduced DeepSeek Sparse Attention). Three architectural moves carry the cost story: Multi-head Latent Attention (MLA) compresses the KV cache into low-rank latents and cuts memory by roughly 7× without quality loss; DeepSeek Sparse Attention learns a sparsity pattern over the context so long sequences do not pay full quadratic attention; and an auxiliary-loss-free routing strategy balances 256 routed experts plus 1 shared expert using a bias-update trick instead of a router penalty term. The whole stack trains in FP8 mixed precision, so a frontier-quality model lands in roughly half the GPU hours its dense peers consume.

Shipping it in production

License is MIT — the most permissive of the four, no acceptable-use policy and no name restriction. Weights and tech reports are on Hugging Face; the V3.2 model card lists BF16, FP8 (E4M3), and FP32 tensor types, so FP8 inference is supported out of the box on H100/H200 hardware, and INT4 community quants run on smaller boxes. A single 8×H200 node serves V3.2 comfortably under SGLang, vLLM, or TGI. The geopolitical context is real: some US regulated buyers will not deploy a Chinese-lab model without a security review, and inference providers in the US have a thinner V3 fleet than they do for Llama. Inside CN, the ecosystem is the strongest of the four.

The axis DeepSeek is betting on

Inference economics. DeepSeek's whole research arc is "frontier quality at the cheapest plausible per-token cost," and the engineering — MLA, DSA, FP8, sparse routing — exists because each piece shrinks either the KV cache, the long-context compute, or the dollar cost of producing a token. If you measure your retrieval and agent stacks in dollars per million tokens served and the agent does long-context tool-use over and over, V3.2 is the model whose price/quality curve genuinely moves your unit economics.

Qwen3 — deep dive

Qwen3 architecture Qwen3-235B-A22B is an MoE flagship with a hybrid reasoning mode: one set of weights can switch between thinking and non-thinking responses at inference time. The family scales from 0.6B dense to 235B MoE under Apache 2.0, supports 100+ languages, and ships with strong tool-use and agent-shaped post-training. Prompt 100 + languages supported Mode flag /think or /no_think Hybrid reasoning weights ONE checkpoint, TWO behaviors "thinking" — long chains, hidden scratch, then answer "non-thinking" — fast, no scratch user picks per request no need to host two models MoE backbone 235B total · ~22B active 128 experts, top-8 routing also dense variants: 0.6B → 32B 256K context (extendable to 1M) Apache 2.0 across the family Agent-shaped post-training tool-use SFT + RL on the same weights strong on Qwen-Agent / function-calling format MCP and OpenAI tool schemas both supported code (Qwen3-Coder) + math (Qwen3-Math) siblings Language coverage 119 languages / dialects in training mix strong on CJK + South Asian + low-resource non-English agent benchmarks competitive leading open-weights option outside English Shipping it in production Apache 2.0 no MAU clause Hugging Face + ModelScope official quantized GGUFs Qwen-Agent SDK tools + memory DashScope API first-party hosted
Qwen3's headline trick is hybrid reasoning on a single set of weights: the same checkpoint serves the easy questions fast and the hard ones thoughtfully, picked per-request by the caller.

Architecture and training bet

Qwen3 is a family: dense models from 0.6B through 32B, and an MoE flagship Qwen3-235B-A22B with about 22B active parameters and 128 experts (top-8 routing), all under Apache 2.0. The signature move is hybrid reasoning — one checkpoint with two behaviors, switchable by a /think or /no_think flag in the prompt. The thinking mode produces an internal scratch (hidden from the final answer) then commits to the response; the non-thinking mode skips the scratch entirely and replies at chat speed. Updated Instruct-2507 and Thinking-2507 variants landed in July 2025 with sharper agent skills. Context is 256K natively, extendable to ~1M with YaRN, and the training mix covers 119 languages — the broadest of the four flagships.

Shipping it in production

License is Apache 2.0 across the entire family — no MAU clause, no name restriction, commercial-use ready. Weights ship on Hugging Face and ModelScope with official GGUF quantizations; the Qwen-Agent SDK gives you a first-party agent runtime with MCP support and built-in tool patterns; DashScope is Alibaba's first-party hosted API. The dense siblings matter more than they look: when the 235B flagship is too heavy, the 14B or 32B dense Qwen3 runs on a single GPU with strong tool-use, which closes the deployment gap for teams that cannot afford a 4-node serving cluster. The Coder and Math siblings (Qwen3-Coder, Qwen3-Math) inherit the same post-training shape for specialty tasks.

The axis Qwen3 is betting on

Language coverage and agent-shaped reasoning. Qwen3 is the open-weights pick if your users are global — CJK, South Asian, Arabic, low-resource languages all have meaningful depth — and the hybrid reasoning + agent-tuned post-training mean a tool-using agent works out of the box without you grafting on a CoT prompt or a separate reasoning model. The dense siblings keep the family deployable from a single GPU up to a serving cluster, which is the part of "open-weights flagship" that the headline 235B model alone hides.

Mistral Large 3 — deep dive

Mistral Large 3 architecture Mistral Large 3 is a 675B-total / 41B-active MoE released under Apache 2.0 in December 2025. Its bet is permissive-license frontier intelligence: a clean license that a regulated team can ship without legal review, EU-hosted weights, and a model that competes with closed flagships on general benchmarks. Apache 2.0 — the part that decides everything else no MAU cap, no field-of-use restrictions, no acceptable-use policy a lawyer has to re-read every quarter EU-hosted weights · BYOC deployment story baked in from day one Sparse MoE backbone 675B total · 41B active per token first MoE since Mixtral trained on NVIDIA H200 cluster base + instruct + (upcoming) reasoning variants Image understanding + multilingual image inputs (vision encoder) 40 + languages, EU-language depth 128K context · tool-use first class debuted top-2 OSS non-reasoning on LMArena Where you actually run it self-host weights on your infrastructure Mistral AI Studio (EU-hosted API) Azure AI Foundry, AWS Bedrock, GCP Vertex vLLM, TGI, Ollama, llama.cpp The bet frontier-class quality + Apache 2.0 regulated jurisdictions can ship it as-is EU data-residency story is the default, not an add-on "permissive open-weights flagship" niche Released December 2, 2025 — currently Mistral's most capable open-weights model
Mistral Large 3 puts the license at the top of the diagram on purpose — that one decision is the part of the architecture a regulated buyer will read first.

Architecture and training bet

Mistral Large 3 is a sparse MoE with 41B active and 675B total parameters, released December 2, 2025 — Mistral's first MoE flagship since the Mixtral series and currently the largest fully open-weights MoE under a permissive license at this scale. It is multimodal in (image inputs through a vision encoder), text-out, covers 40+ languages with deliberate EU-language depth, supports a 128K context, and was trained on NVIDIA's H200 cluster. Base and instruction-tuned variants ship together; a reasoning variant was announced as upcoming at release. On LMArena it debuted at #2 in the OSS non-reasoning category, which puts it in striking distance of closed flagships on general tasks.

Shipping it in production

License is Apache 2.0 — and this is the part Mistral's pitch keeps front and center. There is no MAU cap, no field-of-use restriction, and no acceptable-use policy a lawyer has to re-litigate every quarter. EU-hosted weights and an EU-hosted Mistral AI Studio API are the default deployment path, with Azure AI Foundry, AWS Bedrock, and GCP Vertex as multi-cloud partners. Self-hosting works on vLLM, TGI, Ollama, and llama.cpp; community quantizations cover FP8 and INT4. Of all four flagships, Large 3 is the easiest to clear through procurement and the GDPR/EU AI Act review at the same time.

The axis Mistral Large 3 is betting on

Permissive-license frontier intelligence for regulated jurisdictions. The bet is that the open-weights model that wins inside European banks, telcos, defense suppliers, and healthcare systems is the one whose license, training compute, and hosting story all read clean to a compliance team. The active parameter count (41B, the largest of the four) costs more per token than DeepSeek's stack, but the model is positioned as a frontier-class tier, not the cheapest one. If your buyer cares about where the weights were trained, where they run, and what the license says before they care about MMLU, Large 3 is the model with the simplest answer.

Cross-cutting comparison

Architecture shape — dense vs MoE, active vs total

Architecture shape — dense vs MoE, active vs total Four-column comparison of model shape. Llama 4 Maverick: 17B active / 400B total, 128 experts. DeepSeek V3.2: ~37B active / 685B total, 256 experts plus shared. Qwen3-235B-A22B: 22B active / 235B total, 128 experts. Mistral Large 3: 41B active / 675B total, MoE. Architecture shape — active vs total parameters Llama 4 (Maverick) MoE · 128 experts 17B active 400B total Scout sibling: 17B / 109B, 16 experts native multimodal 10M context (Scout) smallest active count DeepSeek V3.2 MoE · 256 routed + 1 shared expert ~37B active 685B total MLA + DSA cut KV cache FP8 mixed precision 128K context finest-grained sparsity Qwen3-235B-A22B MoE · 128 experts 22B active 235B total also dense 0.6B → 32B hybrid /think modes 256K context (1M extended) 100+ languages tightest active band Mistral Large 3 MoE · sparse 41B active 675B total first MoE since Mixtral image inputs 128K context 40+ languages largest active count
All four are MoE, but the active-to-total ratios and expert counts say different things about where each lab spends its compute.

All four ship MoE, so the right question is not "MoE vs dense" but how much capacity sits behind each active parameter. Llama 4 Maverick has the tightest active count at 17B with 128 experts, betting that the router can find the right specialist among many at low per-token cost — and Scout pairs that with the longest context window of any open model. DeepSeek V3.2 stretches that strategy further with 256 routed experts plus a shared expert, the finest-grained sparsity in this set, and pairs it with MLA + sparse attention so the per-token win compounds across long contexts. Qwen3 takes the middle ground (22B active, 128 experts) and adds two compensations that nobody else has: a dense ladder from 0.6B to 32B for teams who cannot run the flagship at all, and a hybrid reasoning mode that lets one checkpoint serve both quick and deliberate queries. Mistral Large 3 has the largest active parameter count at 41B — explicitly a frontier-quality bet rather than a cheapest-per-token one — and the smallest total parameter share of the four (675B is just under DeepSeek's 685B, but with a coarser expert split). The shape map is clean: DeepSeek minimizes cost per active parameter, Llama 4 stretches context furthest, Qwen3 covers the deployment ladder, Mistral pays for quality.

Inference economics — context, KV cache, FP8 / quantization

Inference economics — context, KV cache, FP8 quantization Comparison of the per-token cost story. Llama 4 Scout: 10M context via iRoPE, fits one H100 at INT4. DeepSeek V3.2: MLA + DeepSeek Sparse Attention cut KV cache and long-context compute, FP8-native. Qwen3: 256K context, hybrid mode lets short queries skip thinking. Mistral Large 3: 128K context, EU-hosted serving. Inference economics — context, KV cache, FP8 Llama 4 Scout: 10M context Maverick: 1M context iRoPE long-context scaling int4 on a single H100 large KV at full context vLLM / TGI / Bedrock longest context window DeepSeek V3.2 128K context MLA compresses KV ~7× DeepSeek Sparse Attention prunes long-context compute FP8 native, INT4 supported ~half GPU hours per quality cheapest per useful token Qwen3 256K context extendable to 1M hybrid /think mode non-thinking ≈ chat speed official GGUFs at q4/q8 dense siblings for edge spend matches the question Mistral Large 3 128K context 41B active is the biggest of the four — strongest single-pass quality FP8 / INT4 community quants EU-hosted API tier priced as a premium tier
The dimension where shipping at scale gets expensive — KV cache and long-context compute — is where the four pull furthest apart.

Inference economics is the dimension agentic workloads punish hardest: every tool call replays the prefix, every long retrieval inflates the KV cache, and an agent's loop multiplies whatever the per-token cost is by ten or fifty (see the cost ladder in cost, quality, and latency). DeepSeek V3.2 is the one that engineered for this directly — MLA compresses the KV cache ~7×, DeepSeek Sparse Attention prunes long-context attention, FP8 is the training and inference default — so a long-running tool-use agent costs noticeably less per useful turn than the others at comparable quality. Qwen3 gets there by a different route: the hybrid /think mode means easy turns of the agent skip the scratch entirely, so spend matches the difficulty of each step rather than always paying for full reasoning. Llama 4 Scout has the longest context window of the four (10M tokens via iRoPE), which is a feature for some agent shapes — multi-document summarization, codebase-wide reasoning — but a cost trap for others, because the KV cache for a fully populated 10M-token context is enormous; int4 quantization on a single H100 mitigates this only at modest fill. Mistral Large 3 has the highest active parameter count (41B) of the four and is priced like a premium tier — it is not the model you choose because the per-token cost is the lowest; you choose it because the license or jurisdictions story dominates.

License and ecosystem — what you can actually ship and under what terms

License and ecosystem — what you can actually ship and under what terms Llama 4 ships under the Llama Community License, restrictive above 700M MAU and with an acceptable-use policy. DeepSeek V3.2 is MIT — fully permissive. Qwen3 is Apache 2.0 across the family. Mistral Large 3 is Apache 2.0, the largest open-weights MoE under a fully permissive license. License + ecosystem — what you can actually ship Llama 4 Llama Community License free under 700M MAU acceptable-use policy name-with-derivatives rule biggest ecosystem thousands of fine-tunes every cloud + serving stack deepest ecosystem DeepSeek V3.2 MIT — fully permissive no MAU cap, no AUP no name restrictions geopolitical questions for US regulated buyers SGLang, vLLM, TGI strong CN cloud presence most permissive on paper Qwen3 Apache 2.0 across family no MAU cap no name restrictions commercial-use ready huge non-English coverage Qwen-Agent SDK, MCP ModelScope + Hugging Face cleanest permissive flagship Mistral Large 3 Apache 2.0 largest permissive MoE at frontier scale EU-hosted weights GDPR / EU AI Act ready Azure, AWS, GCP partners Mistral AI Studio regulated-jurisdiction pick
The license clause is the part the model card hides but that procurement reads first.

License is the cleanest axis to compare, and it is the one most spec-sheet comparisons skip. Llama 4 ships under the Llama 4 Community License: free below 700M monthly active users, with an acceptable-use policy and a clause requiring derivative model names to begin with "Llama". That is permissive enough for almost every team to start with and restrictive enough that legal will want a look before a hyperscaler product or regulated buyer commits. DeepSeek V3.2 is MIT — the most permissive license of the four on paper — and the practical caveat is geopolitical: many US-regulated buyers will not deploy a Chinese-lab model without an extra security review, and US inference vendors carry V3 thinner than they carry Llama. Qwen3 is Apache 2.0 across the whole family with no MAU clause and no name restriction; it is the cleanest "open-weights flagship" of the four for commercial use in non-restricted jurisdictions, and the Qwen-Agent + ModelScope + Hugging Face combination gives it a deeper ecosystem than its newcomer status suggests. Mistral Large 3 is Apache 2.0 at frontier scale — the largest permissive MoE released by a major lab as of mid-2026 — with EU-hosted weights and a Mistral AI Studio API that ship with GDPR/EU AI Act answers ready. For procurement, the order from easiest to hardest to clear is roughly: Qwen3 ≈ Mistral Large 3 < Llama 4 < DeepSeek V3.2 (where DeepSeek's MIT is more permissive on paper, but the practical clearance bar is higher for US-regulated buyers).

When to pick which

Use case Pick Llama 4 if… Pick DeepSeek V3.2 if… Pick Qwen3 if… Pick Mistral Large 3 if…
Per-token cost dominates Only via Scout int4 on a single H100 at modest context. Yes — MLA, DSA, FP8 are designed for this exact answer. Hybrid /think saves spend on the easy half of agent turns. No — priced as a frontier-quality tier, not a cheapest one.
Native multimodal product Yes — early fusion is the architecture, not an adapter. Not yet — V3.2 is text-only; vision sibling separate. Qwen3-VL sibling, post-trained for vision-language. Image inputs supported; text-out only.
Multilingual agent (non-English users) 12 official languages; English-leaning beyond that. EN + CJK depth; less coverage outside. Yes — 119 languages, depth in CJK / South Asian / low-resource. 40+ languages, deliberate EU-language depth.
Regulated jurisdiction / strict procurement Acceptable below 700M MAU; legal must read the AUP. MIT on paper; CN-origin review may slow US deployments. Apache 2.0, clean for commercial use globally. Yes — Apache 2.0 + EU-hosted weights + GDPR-ready story.
Smallest viable serving footprint Scout int4 on a single H100 is the headline; Maverick needs more. Single 8×H200 node serves V3.2 well, smaller via quants. Dense siblings (0.6B → 32B) cover the small end natively. 675B total needs serious capacity; community quants help.

FAQ

Which of these is actually the "best" open-weights model in mid-2026?

There is no single winner — and treating the question this way is how teams get stuck on benchmark leaderboards instead of shipping. The honest answer is that each model wins on its own axis: DeepSeek V3.2 on per-token cost at frontier quality; Llama 4 on multimodal-in plus the deepest ecosystem; Qwen3 on language coverage plus a deployable dense ladder; Mistral Large 3 on permissive-license frontier quality for regulated buyers. If you have to pick one without context, Qwen3-235B-A22B is the safest default because its weaknesses are the most evenly distributed and its license is the cleanest globally — but "safe default" is not "best for your stack." For the underlying selection mindset see choosing a model and reading benchmarks.

How do these compare with closed-weights frontier models (GPT-5, Claude 4.x, Gemini 2.x)?

On general benchmarks the gap has narrowed to the point where the open-weights flagships are within a small percentage of the closed frontier on most non-reasoning tasks, and the gap on long-horizon agent tasks is wider but closing each release. The real differences are not in MMLU but in (a) the closed frontier's better-trained tool use, multi-step reasoning, and safety post-training; (b) the closed frontier's hosted infrastructure (caching, batching, multi-region) you do not have to operate; (c) the open-weights frontier's hard cost and data-boundary advantages. The clean framing in open vs closed models still applies: the question is not "which is better" but "where do I want the trade-offs to land."

Do I actually need MoE, or is a dense model fine?

Every flagship in this comparison is MoE because at the frontier MoE buys quality per active FLOP — but for many production teams the right answer is still a smaller dense model. Qwen3 makes this explicit by shipping a dense ladder from 0.6B to 32B under the same family; a 14B or 32B dense Qwen3 with the Qwen-Agent SDK runs an agent loop on a single consumer GPU with strong tool-use, which the 235B MoE flagship cannot. Mistral ships a separate Ministral family (3B-14B) for the same reason. Use the MoE flagship when you actually need the quality; use a dense sibling when serving cost or latency dominates. See cost, quality, and latency.

Why is "the axis they're betting on" the thing that matters, not the benchmarks?

Because the benchmarks keep getting passed back and forth — every release flips the leaderboard for a quarter — but the architectural and licensing bets stay stable across versions. A team that picks DeepSeek for inference economics is still going to want DeepSeek-style sparse attention in V4 and V5; a team that picks Mistral for the Apache 2.0 story will still want that license in Large 4. The axis is the durable choice; the benchmark winner is a snapshot. The reading benchmarks primer expands on this trap.

Does open-weights mean I can fine-tune freely?

Mostly yes, with one license-shaped caveat per model. Apache 2.0 (Qwen3, Mistral Large 3) and MIT (DeepSeek V3.2) let you fine-tune, ship derivatives, and even rename freely. Llama 4's Community License lets you fine-tune and ship derivatives, but the derivative model name must begin with "Llama" and the acceptable-use policy travels with the weights. For LoRA-shaped customization those constraints rarely bite; for a full SFT that becomes your product surface, legal should look at it. The SFT, rejection sampling, and distillation deep-dive covers the technical side; the license clause is the part to read line-by-line.

What about Llama 4 Behemoth, DeepSeek V4, or whatever drops next?

Llama 4 Behemoth was previewed in April 2025 but has not been released as open weights as of mid-2026; treat it as roadmap, not product. DeepSeek's release cadence has been one major update every ~four months (V3 → V3.1 → V3.2), so a V4 or V3.3 inside this article's lifetime is a safe bet. Qwen has shown the same cadence (Qwen3 → Qwen3.5 in early 2026 was reported, scaling to higher parameter counts). The point of comparing on axes — multimodality, inference economics, language coverage, license — is that the axis is stable even as the version number ticks. The reasoning carries forward; only the numbers change.

Further reading

On this wiki:

  • Open- vs closed-weights models — the trade-off framing that makes "which flagship?" tractable: what you give up and what you get when the weights are in your hands.
  • Choosing a model — the constraint-first selection guide that turns "best model" into "best for this constraint."
  • Model families — why a "model" is really a family of checkpoints (base, instruct, thinking, multimodal sibling) and how that family shape matters as much as the flagship.
  • Reading benchmarks — how to read MMLU / GPQA / SWE-bench numbers without getting played by leaderboard churn.
  • Cost, quality, and latency — the triangle that decides which flagship's per-token economics matters for your workload.
  • Reasoning models — what "thinking mode" and hybrid /think actually do at training time and why the hybrid checkpoint is a real win.
  • Inference providers — where these open-weights flagships actually run when you don't want to host them yourself.
  • Agentic AI for trading research — a worked applied case: these flagships are exactly the "prompted general LLM" corner of that post's domain-vs-general decision triangle, where the cost-of-call and reasoning-mode trade-offs decide which agent role each flagship can play.

Project sources: