Modalities & multimodal models

E3
Concepts · The AI Model & Tooling Ecosystem

Modalities: text, vision, audio, code, and "multimodal."

This entry defines what a model's modalities are, why "multimodal" is a spectrum rather than a checkbox, and how to reason about which inputs and outputs a model can actually handle — so you pick a model that can attempt your problem at all, before you worry about how well it does it.

STEP 1

Modality = the kind of data, on input and on output.

A modality is a type of data: text, images, audio, video, and so on. The crucial subtlety is that input modalities and output modalities are separate. A model can accept images but only produce text (very common), or accept text and produce audio, or accept and produce several. When a spec says "multimodal," always ask: multimodal in, out, or both?

  • Text. The baseline. Nearly every large language model is text-in / text-out at its core.
  • Vision (image input). The model accepts images alongside text — charts, screenshots, photos, documents, diagrams — and reasons about them in text. Now common in frontier families.
  • Audio. Speech (and sometimes general sound) as input, output, or both. Enables low-latency voice interfaces without a separate speech-to-text stage.
  • Video. Sequences of frames, sometimes with audio. The most demanding input modality; support is more uneven.
  • Code. Not a separate sensory modality but worth treating as one in practice — code has its own evaluation regime, and "good at code" is a distinct capability axis from "good at prose."
  • Image / audio generation. Producing pixels or waveforms, often handled by specialized generative models rather than a general LLM, though the lines are blurring.
STEP 2

"Multimodal" is a spectrum, not a binary.

Two systems can both be called "multimodal" and work very differently:

Pipelined (adapter) multimodality

Separate components are chained: a speech-to-text model transcribes audio, a text LLM reasons over the transcript, a text-to-speech model speaks the answer. Each stage is swappable and debuggable, but information is lost at each boundary — tone, hesitation, and overlapping speech do not survive transcription, and latency stacks up.

Natively (jointly) multimodal

A single model is trained so that different modalities share a representation space. Such a model can reason across modalities — e.g. relate what is said to what is shown, or preserve tone of voice — and typically responds with lower latency because there are no inter-stage hops. The trade-off is less inspectability of the intermediate steps.

This distinction matters for design. If your application needs to reason about how something was said, or to ground an answer in a specific region of an image, native multimodality is doing real work a pipeline cannot replicate. If you mostly need transcription plus text reasoning, a pipeline is cheaper, more debuggable, and lets you swap each stage independently.

STEP 3

Practical implications you will actually hit.

  • Tokens still apply to non-text. Images and audio are converted into model tokens too. A high-resolution image can cost hundreds to thousands of tokens; long audio adds up fast. Multimodal context is not free — budget it like any other context.
  • Capability is uneven across modalities. A model strong on text-and-vision may be weaker on audio, and benchmarks are usually per-modality. "Multimodal" on the spec sheet does not promise uniform quality across all of them.
  • Output modality constrains architecture. Most "multimodal" LLMs are multimodal in and text out. If you need generated images or speech, you typically reach for a different, specialized model and orchestrate the two.
  • Vision is not OCR. Vision models reason about images holistically; they are excellent at "what is happening in this screenshot" but can still misread fine print or dense tables. For high-stakes exact text extraction, verify or pair with a dedicated extraction step.

Modality support is one of the fastest-moving parts of the landscape. Capabilities that were research demos a year ago — real-time speech, video understanding, long-document vision — ship in mainstream APIs on a months-long cadence. Re-check the current model card rather than relying on what was true last cycle.

STEP 4

The decision question.

Before comparing quality or price, answer a gating question: what goes in, and what must come out? List your real input modalities (do users send screenshots? voice? PDFs?) and your required output (text? structured data? speech?). That eliminates most of the candidate list immediately and prevents the classic mistake of benchmarking models that physically cannot ingest your data. Only among the survivors do cost, quality, and latency become the deciding factors — which the next entries address.