Llama 4 vs DeepSeek V3 vs Qwen3 vs Mistral Large 3: Four Open-Weights Flagships, Four Different Bets

Q: Which of these is actually the "best" open-weights model in mid-2026?

There is no single winner — and treating the question this way is how teams get stuck on benchmark leaderboards instead of shipping. The honest answer is that each model wins on its own axis: DeepSeek V3.2 on per-token cost at frontier quality; Llama 4 on multimodal-in plus the deepest ecosystem; Qwen3 on language coverage plus a deployable dense ladder; Mistral Large 3 on permissive-license frontier quality for regulated buyers. If you have to pick one without context, Qwen3-235B-A22B is the safest default because its weaknesses are the most evenly distributed and its license is the cleanest globally — but "safe default" is not "best for your stack." For the underlying selection mindset see choosing a model and reading benchmarks.

Q: How do these compare with closed-weights frontier models (GPT-5, Claude 4.x, Gemini 2.x)?

On general benchmarks the gap has narrowed to the point where the open-weights flagships are within a small percentage of the closed frontier on most non-reasoning tasks, and the gap on long-horizon agent tasks is wider but closing each release. The real differences are not in MMLU but in (a) the closed frontier's better-trained tool use, multi-step reasoning, and safety post-training; (b) the closed frontier's hosted infrastructure (caching, batching, multi-region) you do not have to operate; (c) the open-weights frontier's hard cost and data-boundary advantages. The clean framing in open vs closed models still applies: the question is not "which is better" but "where do I want the trade-offs to land."

Q: Do I actually need MoE, or is a dense model fine?

Every flagship in this comparison is MoE because at the frontier MoE buys quality per active FLOP — but for many production teams the right answer is still a smaller dense model. Qwen3 makes this explicit by shipping a dense ladder from 0.6B to 32B under the same family; a 14B or 32B dense Qwen3 with the Qwen-Agent SDK runs an agent loop on a single consumer GPU with strong tool-use, which the 235B MoE flagship cannot. Mistral ships a separate Ministral family (3B-14B) for the same reason. Use the MoE flagship when you actually need the quality; use a dense sibling when serving cost or latency dominates. See cost, quality, and latency.

Q: Why is "the axis they're betting on" the thing that matters, not the benchmarks?

Because the benchmarks keep getting passed back and forth — every release flips the leaderboard for a quarter — but the architectural and licensing bets stay stable across versions. A team that picks DeepSeek for inference economics is still going to want DeepSeek-style sparse attention in V4 and V5; a team that picks Mistral for the Apache 2.0 story will still want that license in Large 4. The axis is the durable choice; the benchmark winner is a snapshot. The reading benchmarks primer expands on this trap.

Q: Does open-weights mean I can fine-tune freely?

Mostly yes, with one license-shaped caveat per model. Apache 2.0 (Qwen3, Mistral Large 3) and MIT (DeepSeek V3.2) let you fine-tune, ship derivatives, and even rename freely. Llama 4's Community License lets you fine-tune and ship derivatives, but the derivative model name must begin with "Llama" and the acceptable-use policy travels with the weights. For LoRA-shaped customization those constraints rarely bite; for a full SFT that becomes your product surface, legal should look at it. The SFT, rejection sampling, and distillation deep-dive covers the technical side; the license clause is the part to read line-by-line.

Q: What about Llama 4 Behemoth, DeepSeek V4, or whatever drops next?

Llama 4 Behemoth was previewed in April 2025 but has not been released as open weights as of mid-2026; treat it as roadmap, not product. DeepSeek's release cadence has been one major update every ~four months (V3 → V3.1 → V3.2), so a V4 or V3.3 inside this article's lifetime is a safe bet. Qwen has shown the same cadence (Qwen3 → Qwen3.5 in early 2026 was reported, scaling to higher parameter counts). The point of comparing on axes — multimodality, inference economics, language coverage, license — is that the axis is stable even as the version number ticks. The reasoning carries forward; only the numbers change.

Read the spec sheets and these four open-weights flagships sound like the same model with different stickers: MoE backbone, long context, reasoning mode, tool use, some flavour of multimodal. The thing that actually decides which one you self-host in production is invisible there: the axis each lab is betting on next. Llama 4 doubles down on native multimodality and an open ecosystem. DeepSeek V3.2 attacks inference economics with MLA, sparse attention, and FP8 training. Qwen3 maximizes language coverage and agent-shaped reasoning. Mistral Large 3 puts a frontier-scale MoE under Apache 2.0 for teams that need a permissive license a regulator will sign off on. Pick on the axis; the benchmarks will keep moving.

At a glance

Four labs, four answers to the same question — what do you optimize when everyone has MoE, long context, and a reasoning mode? The table lists the basics; the matrix below it shows where each model leans hardest across the axes that genuinely differ in production.

Model	Released / maintainer	Architecture	License
Llama 4 (Scout / Maverick)	April 2025, Meta	MoE, native multimodal	Llama 4 Community License
DeepSeek V3.2	December 2025, DeepSeek (V3 family since Dec 2024)	MoE, MLA + DeepSeek Sparse Attention, FP8	MIT
Qwen3 (235B-A22B + dense siblings)	April 2025, Alibaba (Instruct/Thinking-2507 update)	MoE flagship + dense 0.6B → 32B, hybrid /think	Apache 2.0
Mistral Large 3	December 2025, Mistral AI	Sparse MoE, image inputs	Apache 2.0

Snapshot: 2026-06-02. Open-weights flagships move fast — verify against current model cards before committing.

Where each flagship leans hardest. Notice how the matrix is jagged — nobody is strong everywhere, and the strengths don't overlap much.

Llama 4 — deep dive

Llama 4 fuses text and image patches into one stream before routing — multimodality is the architecture, not a bolt-on adapter — and the ecosystem strip is the part the spec sheet hides.

Architecture and training bet

Llama 4 is Meta's first MoE Llama and the first with native multimodality: text tokens and image patches enter the same transformer through an early-fusion path, so the model attends across modalities rather than calling a separate vision encoder as a tool. Scout uses 16 experts with about 17B active of 109B total parameters and pushes context to 10M tokens via an iRoPE-style scaling scheme — the longest window of the four. Maverick trades context (1M) for capacity, with 128 experts and 400B total. A larger Behemoth teacher model was previewed but has not been released. The training run uses ~40T tokens for Scout and a vision-language curriculum that includes interleaved image-text documents from the start.

Shipping it in production

The license is the Llama 4 Community License, not Apache or MIT: free for anyone under 700M monthly active users, an acceptable-use policy a lawyer should read, and a clause that says derivative model names must start with "Llama". Practically, every cloud and inference vendor — Bedrock, Vertex AI, Azure AI Foundry, Groq, Together, Fireworks — has Llama 4 day-one, vLLM and TGI ship optimized kernels, and llama.cpp/Ollama serve the int4 quants on a single H100 or a beefy Mac. The community fine-tune scene is by far the deepest of the four; thousands of LoRAs and full SFTs land on Hugging Face weekly. Weights live at llama.com and hf.co/meta-llama.

The axis Llama 4 is betting on

Open multimodal ecosystem at scale. Meta is not racing the others on raw MMLU or AIME — it is betting that the open-weights model that wins is the one with the most native multimodal capability, the most ubiquitous serving footprint, and the most downstream fine-tunes. If your product reads images, your team prefers the lowest-friction serving path, or you are building on a stack that already speaks Llama, this is the easy default. The cost is the license — if you are a hyperscaler-scale consumer product or a regulated buyer who needs Apache or MIT for procurement, Llama 4 is not the one you get past legal without a conversation.

DeepSeek V3.2 — deep dive

DeepSeek's distinguishing pattern is how every layer is shaped around per-token cost: KV compression, sparse attention, fine-grained MoE, FP8 mixed precision — quality at a fraction of the per-token GPU spend.

Architecture and training bet

DeepSeek V3.2 is a 685B-parameter MoE with roughly 37B active per token, descended from V3 (December 2024) through V3.1 (August 2025, hybrid reasoning/non-reasoning) to V3.2 (December 2025, which introduced DeepSeek Sparse Attention). Three architectural moves carry the cost story: Multi-head Latent Attention (MLA) compresses the KV cache into low-rank latents and cuts memory by roughly 7× without quality loss; DeepSeek Sparse Attention learns a sparsity pattern over the context so long sequences do not pay full quadratic attention; and an auxiliary-loss-free routing strategy balances 256 routed experts plus 1 shared expert using a bias-update trick instead of a router penalty term. The whole stack trains in FP8 mixed precision, so a frontier-quality model lands in roughly half the GPU hours its dense peers consume.

Shipping it in production

License is MIT — the most permissive of the four, no acceptable-use policy and no name restriction. Weights and tech reports are on Hugging Face; the V3.2 model card lists BF16, FP8 (E4M3), and FP32 tensor types, so FP8 inference is supported out of the box on H100/H200 hardware, and INT4 community quants run on smaller boxes. A single 8×H200 node serves V3.2 comfortably under SGLang, vLLM, or TGI. The geopolitical context is real: some US regulated buyers will not deploy a Chinese-lab model without a security review, and inference providers in the US have a thinner V3 fleet than they do for Llama. Inside CN, the ecosystem is the strongest of the four.

The axis DeepSeek is betting on

Inference economics. DeepSeek's whole research arc is "frontier quality at the cheapest plausible per-token cost," and the engineering — MLA, DSA, FP8, sparse routing — exists because each piece shrinks either the KV cache, the long-context compute, or the dollar cost of producing a token. If you measure your retrieval and agent stacks in dollars per million tokens served and the agent does long-context tool-use over and over, V3.2 is the model whose price/quality curve genuinely moves your unit economics.

Qwen3 — deep dive

Qwen3's headline trick is hybrid reasoning on a single set of weights: the same checkpoint serves the easy questions fast and the hard ones thoughtfully, picked per-request by the caller.

Architecture and training bet

Qwen3 is a family: dense models from 0.6B through 32B, and an MoE flagship Qwen3-235B-A22B with about 22B active parameters and 128 experts (top-8 routing), all under Apache 2.0. The signature move is hybrid reasoning — one checkpoint with two behaviors, switchable by a /think or /no_think flag in the prompt. The thinking mode produces an internal scratch (hidden from the final answer) then commits to the response; the non-thinking mode skips the scratch entirely and replies at chat speed. Updated Instruct-2507 and Thinking-2507 variants landed in July 2025 with sharper agent skills. Context is 256K natively, extendable to ~1M with YaRN, and the training mix covers 119 languages — the broadest of the four flagships.

Shipping it in production

License is Apache 2.0 across the entire family — no MAU clause, no name restriction, commercial-use ready. Weights ship on Hugging Face and ModelScope with official GGUF quantizations; the Qwen-Agent SDK gives you a first-party agent runtime with MCP support and built-in tool patterns; DashScope is Alibaba's first-party hosted API. The dense siblings matter more than they look: when the 235B flagship is too heavy, the 14B or 32B dense Qwen3 runs on a single GPU with strong tool-use, which closes the deployment gap for teams that cannot afford a 4-node serving cluster. The Coder and Math siblings (Qwen3-Coder, Qwen3-Math) inherit the same post-training shape for specialty tasks.

The axis Qwen3 is betting on

Language coverage and agent-shaped reasoning. Qwen3 is the open-weights pick if your users are global — CJK, South Asian, Arabic, low-resource languages all have meaningful depth — and the hybrid reasoning + agent-tuned post-training mean a tool-using agent works out of the box without you grafting on a CoT prompt or a separate reasoning model. The dense siblings keep the family deployable from a single GPU up to a serving cluster, which is the part of "open-weights flagship" that the headline 235B model alone hides.

Mistral Large 3 — deep dive

Mistral Large 3 puts the license at the top of the diagram on purpose — that one decision is the part of the architecture a regulated buyer will read first.

Architecture and training bet

Mistral Large 3 is a sparse MoE with 41B active and 675B total parameters, released December 2, 2025 — Mistral's first MoE flagship since the Mixtral series and currently the largest fully open-weights MoE under a permissive license at this scale. It is multimodal in (image inputs through a vision encoder), text-out, covers 40+ languages with deliberate EU-language depth, supports a 128K context, and was trained on NVIDIA's H200 cluster. Base and instruction-tuned variants ship together; a reasoning variant was announced as upcoming at release. On LMArena it debuted at #2 in the OSS non-reasoning category, which puts it in striking distance of closed flagships on general tasks.

Shipping it in production

License is Apache 2.0 — and this is the part Mistral's pitch keeps front and center. There is no MAU cap, no field-of-use restriction, and no acceptable-use policy a lawyer has to re-litigate every quarter. EU-hosted weights and an EU-hosted Mistral AI Studio API are the default deployment path, with Azure AI Foundry, AWS Bedrock, and GCP Vertex as multi-cloud partners. Self-hosting works on vLLM, TGI, Ollama, and llama.cpp; community quantizations cover FP8 and INT4. Of all four flagships, Large 3 is the easiest to clear through procurement and the GDPR/EU AI Act review at the same time.

The axis Mistral Large 3 is betting on

Permissive-license frontier intelligence for regulated jurisdictions. The bet is that the open-weights model that wins inside European banks, telcos, defense suppliers, and healthcare systems is the one whose license, training compute, and hosting story all read clean to a compliance team. The active parameter count (41B, the largest of the four) costs more per token than DeepSeek's stack, but the model is positioned as a frontier-class tier, not the cheapest one. If your buyer cares about where the weights were trained, where they run, and what the license says before they care about MMLU, Large 3 is the model with the simplest answer.

Cross-cutting comparison

Architecture shape — dense vs MoE, active vs total

All four are MoE, but the active-to-total ratios and expert counts say different things about where each lab spends its compute.

All four ship MoE, so the right question is not "MoE vs dense" but how much capacity sits behind each active parameter. Llama 4 Maverick has the tightest active count at 17B with 128 experts, betting that the router can find the right specialist among many at low per-token cost — and Scout pairs that with the longest context window of any open model. DeepSeek V3.2 stretches that strategy further with 256 routed experts plus a shared expert, the finest-grained sparsity in this set, and pairs it with MLA + sparse attention so the per-token win compounds across long contexts. Qwen3 takes the middle ground (22B active, 128 experts) and adds two compensations that nobody else has: a dense ladder from 0.6B to 32B for teams who cannot run the flagship at all, and a hybrid reasoning mode that lets one checkpoint serve both quick and deliberate queries. Mistral Large 3 has the largest active parameter count at 41B — explicitly a frontier-quality bet rather than a cheapest-per-token one — and the smallest total parameter share of the four (675B is just under DeepSeek's 685B, but with a coarser expert split). The shape map is clean: DeepSeek minimizes cost per active parameter, Llama 4 stretches context furthest, Qwen3 covers the deployment ladder, Mistral pays for quality.

Inference economics — context, KV cache, FP8 / quantization

The dimension where shipping at scale gets expensive — KV cache and long-context compute — is where the four pull furthest apart.

Inference economics is the dimension agentic workloads punish hardest: every tool call replays the prefix, every long retrieval inflates the KV cache, and an agent's loop multiplies whatever the per-token cost is by ten or fifty (see the cost ladder in cost, quality, and latency). DeepSeek V3.2 is the one that engineered for this directly — MLA compresses the KV cache ~7×, DeepSeek Sparse Attention prunes long-context attention, FP8 is the training and inference default — so a long-running tool-use agent costs noticeably less per useful turn than the others at comparable quality. Qwen3 gets there by a different route: the hybrid /think mode means easy turns of the agent skip the scratch entirely, so spend matches the difficulty of each step rather than always paying for full reasoning. Llama 4 Scout has the longest context window of the four (10M tokens via iRoPE), which is a feature for some agent shapes — multi-document summarization, codebase-wide reasoning — but a cost trap for others, because the KV cache for a fully populated 10M-token context is enormous; int4 quantization on a single H100 mitigates this only at modest fill. Mistral Large 3 has the highest active parameter count (41B) of the four and is priced like a premium tier — it is not the model you choose because the per-token cost is the lowest; you choose it because the license or jurisdictions story dominates.

License and ecosystem — what you can actually ship and under what terms

The license clause is the part the model card hides but that procurement reads first.

License is the cleanest axis to compare, and it is the one most spec-sheet comparisons skip. Llama 4 ships under the Llama 4 Community License: free below 700M monthly active users, with an acceptable-use policy and a clause requiring derivative model names to begin with "Llama". That is permissive enough for almost every team to start with and restrictive enough that legal will want a look before a hyperscaler product or regulated buyer commits. DeepSeek V3.2 is MIT — the most permissive license of the four on paper — and the practical caveat is geopolitical: many US-regulated buyers will not deploy a Chinese-lab model without an extra security review, and US inference vendors carry V3 thinner than they carry Llama. Qwen3 is Apache 2.0 across the whole family with no MAU clause and no name restriction; it is the cleanest "open-weights flagship" of the four for commercial use in non-restricted jurisdictions, and the Qwen-Agent + ModelScope + Hugging Face combination gives it a deeper ecosystem than its newcomer status suggests. Mistral Large 3 is Apache 2.0 at frontier scale — the largest permissive MoE released by a major lab as of mid-2026 — with EU-hosted weights and a Mistral AI Studio API that ship with GDPR/EU AI Act answers ready. For procurement, the order from easiest to hardest to clear is roughly: Qwen3 ≈ Mistral Large 3 < Llama 4 < DeepSeek V3.2 (where DeepSeek's MIT is more permissive on paper, but the practical clearance bar is higher for US-regulated buyers).

When to pick which

Use case	Pick Llama 4 if…	Pick DeepSeek V3.2 if…	Pick Qwen3 if…	Pick Mistral Large 3 if…
Per-token cost dominates	Only via Scout int4 on a single H100 at modest context.	Yes — MLA, DSA, FP8 are designed for this exact answer.	Hybrid /think saves spend on the easy half of agent turns.	No — priced as a frontier-quality tier, not a cheapest one.
Native multimodal product	Yes — early fusion is the architecture, not an adapter.	Not yet — V3.2 is text-only; vision sibling separate.	Qwen3-VL sibling, post-trained for vision-language.	Image inputs supported; text-out only.
Multilingual agent (non-English users)	12 official languages; English-leaning beyond that.	EN + CJK depth; less coverage outside.	Yes — 119 languages, depth in CJK / South Asian / low-resource.	40+ languages, deliberate EU-language depth.
Regulated jurisdiction / strict procurement	Acceptable below 700M MAU; legal must read the AUP.	MIT on paper; CN-origin review may slow US deployments.	Apache 2.0, clean for commercial use globally.	Yes — Apache 2.0 + EU-hosted weights + GDPR-ready story.
Smallest viable serving footprint	Scout int4 on a single H100 is the headline; Maverick needs more.	Single 8×H200 node serves V3.2 well, smaller via quants.	Dense siblings (0.6B → 32B) cover the small end natively.	675B total needs serious capacity; community quants help.

FAQ

Which of these is actually the "best" open-weights model in mid-2026?

There is no single winner — and treating the question this way is how teams get stuck on benchmark leaderboards instead of shipping. The honest answer is that each model wins on its own axis: DeepSeek V3.2 on per-token cost at frontier quality; Llama 4 on multimodal-in plus the deepest ecosystem; Qwen3 on language coverage plus a deployable dense ladder; Mistral Large 3 on permissive-license frontier quality for regulated buyers. If you have to pick one without context, Qwen3-235B-A22B is the safest default because its weaknesses are the most evenly distributed and its license is the cleanest globally — but "safe default" is not "best for your stack." For the underlying selection mindset see choosing a model and reading benchmarks.

How do these compare with closed-weights frontier models (GPT-5, Claude 4.x, Gemini 2.x)?

On general benchmarks the gap has narrowed to the point where the open-weights flagships are within a small percentage of the closed frontier on most non-reasoning tasks, and the gap on long-horizon agent tasks is wider but closing each release. The real differences are not in MMLU but in (a) the closed frontier's better-trained tool use, multi-step reasoning, and safety post-training; (b) the closed frontier's hosted infrastructure (caching, batching, multi-region) you do not have to operate; (c) the open-weights frontier's hard cost and data-boundary advantages. The clean framing in open vs closed models still applies: the question is not "which is better" but "where do I want the trade-offs to land."

Do I actually need MoE, or is a dense model fine?

Every flagship in this comparison is MoE because at the frontier MoE buys quality per active FLOP — but for many production teams the right answer is still a smaller dense model. Qwen3 makes this explicit by shipping a dense ladder from 0.6B to 32B under the same family; a 14B or 32B dense Qwen3 with the Qwen-Agent SDK runs an agent loop on a single consumer GPU with strong tool-use, which the 235B MoE flagship cannot. Mistral ships a separate Ministral family (3B-14B) for the same reason. Use the MoE flagship when you actually need the quality; use a dense sibling when serving cost or latency dominates. See cost, quality, and latency.

Why is "the axis they're betting on" the thing that matters, not the benchmarks?

Because the benchmarks keep getting passed back and forth — every release flips the leaderboard for a quarter — but the architectural and licensing bets stay stable across versions. A team that picks DeepSeek for inference economics is still going to want DeepSeek-style sparse attention in V4 and V5; a team that picks Mistral for the Apache 2.0 story will still want that license in Large 4. The axis is the durable choice; the benchmark winner is a snapshot. The reading benchmarks primer expands on this trap.

Does open-weights mean I can fine-tune freely?

Mostly yes, with one license-shaped caveat per model. Apache 2.0 (Qwen3, Mistral Large 3) and MIT (DeepSeek V3.2) let you fine-tune, ship derivatives, and even rename freely. Llama 4's Community License lets you fine-tune and ship derivatives, but the derivative model name must begin with "Llama" and the acceptable-use policy travels with the weights. For LoRA-shaped customization those constraints rarely bite; for a full SFT that becomes your product surface, legal should look at it. The SFT, rejection sampling, and distillation deep-dive covers the technical side; the license clause is the part to read line-by-line.

What about Llama 4 Behemoth, DeepSeek V4, or whatever drops next?

Llama 4 Behemoth was previewed in April 2025 but has not been released as open weights as of mid-2026; treat it as roadmap, not product. DeepSeek's release cadence has been one major update every ~four months (V3 → V3.1 → V3.2), so a V4 or V3.3 inside this article's lifetime is a safe bet. Qwen has shown the same cadence (Qwen3 → Qwen3.5 in early 2026 was reported, scaling to higher parameter counts). The point of comparing on axes — multimodality, inference economics, language coverage, license — is that the axis is stable even as the version number ticks. The reasoning carries forward; only the numbers change.

Llama 4 vs DeepSeek V3 vs Qwen3 vs Mistral Large 3: Four Open-Weights Flagships, Four Different Bets

At a glance

Llama 4 — deep dive

Architecture and training bet

Shipping it in production

The axis Llama 4 is betting on

DeepSeek V3.2 — deep dive

Architecture and training bet

Shipping it in production

The axis DeepSeek is betting on

Qwen3 — deep dive

Architecture and training bet

Shipping it in production

The axis Qwen3 is betting on

Mistral Large 3 — deep dive

Architecture and training bet

Shipping it in production

The axis Mistral Large 3 is betting on

Cross-cutting comparison

Architecture shape — dense vs MoE, active vs total

Inference economics — context, KV cache, FP8 / quantization

License and ecosystem — what you can actually ship and under what terms

When to pick which

FAQ

Which of these is actually the "best" open-weights model in mid-2026?

How do these compare with closed-weights frontier models (GPT-5, Claude 4.x, Gemini 2.x)?

Do I actually need MoE, or is a dense model fine?

Why is "the axis they're betting on" the thing that matters, not the benchmarks?

Does open-weights mean I can fine-tune freely?

What about Llama 4 Behemoth, DeepSeek V4, or whatever drops next?

Further reading

On this wiki:

Project sources: