AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What a Modality Actually MeansInput versus output is not symmetricThe Major Input ModalitiesTextImagesAudioVideo and documentsThe Major Output ModalitiesHow Multimodal Fusion WorksEarly versus late fusionThe Trade-offs That Govern Real SystemsCost scales with densityLatency compoundsReliability drops at the edgesChoosing Modalities for a Real FeatureFrequently Asked QuestionsIs every large model multimodal now?Does accepting a modality mean the model can generate it?Why is video so much more expensive than text?What is the safest output modality to start with?How do I decide which modalities my product needs?Key Takeaways
Home/Blog/How Models See, Hear, and Speak: The Modality Map
General

How Models See, Hear, and Speak: The Modality Map

A

Agency Script Editorial

Editorial Team

·June 10, 2024·7 min read
ai model input and output modalitiesai model input and output modalities guideai model input and output modalities guideai fundamentals

For most of the last decade, the working picture of an AI model was simple: you typed words, the model typed words back. That picture is now wrong often enough to be a liability. A modern system can accept a photograph, a PDF, a thirty-second voice memo, and a spreadsheet in a single request, then respond with structured data, a generated image, or synthesized speech. The boundary between "what you put in" and "what you get out" has become a design surface, not a fixed constraint.

Understanding ai model input and output modalities is no longer an academic exercise. It determines what products you can build, how much you pay per request, how fast responses arrive, and where your system will quietly fail. The agencies and teams that ship reliable AI features treat modality as a first-class architectural decision rather than something they discover by accident when a customer uploads a screenshot.

This guide maps the territory end to end. We cover what a modality actually is, the major input and output types you will encounter, how multimodal models fuse them internally, and the practical trade-offs that govern real systems. The goal is fluency: by the end you should be able to look at a product idea and reason clearly about which modalities it needs and what that choice costs you.

What a Modality Actually Means

A modality is a distinct form of data that a model can process or produce: text, images, audio, video, and increasingly structured formats like tables or code. The word matters because each modality has its own representation, its own tokenization or encoding scheme, and its own failure characteristics. Text is discrete and ordered. Images are dense grids of pixels. Audio is a continuous waveform sampled thousands of times per second.

Input versus output is not symmetric

A common beginner assumption is that any modality a model accepts, it can also produce. That is rarely true. Plenty of models accept images as input for analysis but cannot generate them. Many speech-to-text systems consume audio and emit only text. When you scope a feature, separate the two questions: what can this model take in, and what can it give back? Treating them as one question is the fastest way to design something the model cannot do.

If this distinction is new to you, our beginner's walkthrough of input and output types builds the vocabulary from zero.

The Major Input Modalities

Text

Text remains the backbone. It is cheap to process, easy to validate, and supported everywhere. Most "multimodal" pipelines still convert other modalities into text-adjacent representations before reasoning over them.

Images

Vision input lets a model read screenshots, diagrams, handwriting, charts, and product photos. The model encodes the image into a sequence of visual tokens that live in the same representation space as text tokens, which is what allows a single prompt to reference both.

Audio

Audio input covers transcription, speaker analysis, tone detection, and direct audio reasoning. Some models transcribe first and reason over the text; newer ones reason over the audio representation directly, which preserves information like emphasis and emotion that transcription throws away.

Video and documents

Video is effectively images plus audio plus time, and it is the most expensive input by a wide margin. Documents (PDFs, slides) sit in between: structured but visually rich, often requiring both layout understanding and text extraction.

The Major Output Modalities

  • Text: still the default, and the only output you can reliably parse, validate, and store.
  • Structured data: JSON, tables, and schema-constrained output. Technically text, but worth treating separately because it unlocks automation.
  • Images: generated illustrations, edits, and variations.
  • Audio: synthesized speech and, increasingly, music or sound effects.
  • Code: a specialized text output with its own tooling and verification needs.

For output, structured data deserves special attention. When a model returns clean JSON instead of prose, you can pipe it directly into downstream systems. Our framework for reasoning about modality choices treats structured output as the highest-leverage decision most teams overlook.

How Multimodal Fusion Works

The reason a single model can juggle a photo and a paragraph is fusion: every input modality gets encoded into a shared internal representation, often called an embedding space. Once an image and a sentence both live as sequences of vectors in the same space, the model's attention mechanism can relate them as if they were one continuous stream.

Early versus late fusion

Early fusion combines modalities before the main reasoning layers, letting the model build cross-modal understanding from the ground up. Late fusion processes each modality separately and merges results near the end. Early fusion is more powerful and more expensive; late fusion is cheaper and easier to debug. Most frontier systems lean early, but plenty of production pipelines use late fusion because it is predictable.

The Trade-offs That Govern Real Systems

Every modality decision is a negotiation among three forces: cost, latency, and reliability.

Cost scales with density

Text is cheap. Images cost more because a single image can consume hundreds or thousands of tokens. Video is the most expensive because it multiplies image cost by frame count. A feature that "just adds video support" can quietly multiply your bill by a factor of fifty.

Latency compounds

Non-text outputs are slow to generate. Image generation and speech synthesis add seconds, not milliseconds. If your interface needs to feel instant, push non-text modalities to the background or generate them lazily.

Reliability drops at the edges

Models are most reliable on text and degrade as inputs get noisier: a clean screenshot reads well, a blurry photo of a receipt does not. Plan validation around the worst input you will actually receive, not the best. The most common modality mistakes almost all trace back to assuming clean inputs.

Choosing Modalities for a Real Feature

Start from the user's job, not the model's capability. Ask what the user already has (a photo? a voice note? a form?) and what they actually need back (an answer? a file? an action?). Then pick the smallest set of modalities that connects those two points. Adding a modality you do not need is the most common form of over-engineering in AI products.

When you do need richer modalities, prototype the hardest path first. If your product depends on reading messy handwritten input, test that on day one. Do not build the happy path and discover the real requirement at launch. Our collection of real-world modality examples shows how this plays out across support, sales, and operations use cases.

Frequently Asked Questions

Is every large model multimodal now?

No. Many capable models are still text-only, and some accept images but not audio or video. Always check the specific model's supported input and output modalities before designing around them, because marketing language often overstates coverage.

Does accepting a modality mean the model can generate it?

Almost never automatically. Input and output capabilities are separate. A model might read images perfectly while being unable to produce a single one, so scope the two directions independently.

Why is video so much more expensive than text?

Video is a stack of images sampled over time, and each frame carries the full token cost of an image. A short clip can contain hundreds of frames, so video requests routinely cost dozens of times more than an equivalent text prompt.

What is the safest output modality to start with?

Structured text, specifically schema-constrained JSON. It is parseable, testable, and storable, which means you can build reliable automation on top of it before you ever touch images or audio.

How do I decide which modalities my product needs?

Map what the user already has against what they need back, then choose the smallest set of modalities that bridges the gap. Resist adding modalities for novelty; each one adds cost, latency, and failure modes.

Key Takeaways

  • A modality is a distinct data form (text, image, audio, video, structured data), and each has its own cost, latency, and reliability profile.
  • Input and output modalities are independent; never assume a model can produce what it can consume.
  • Fusion via a shared embedding space is what lets one model reason across a photo and a sentence at once.
  • Cost scales with data density, latency compounds for non-text outputs, and reliability drops as inputs get noisier.
  • Choose modalities by mapping what the user has to what they need, and prototype the hardest path first.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification