Abstract advice about ai model input and output modalities only goes so far. The decisions that matter become obvious the moment you watch them play out in a real feature: which input the team chose, what output shape they demanded, and the single detail that separated a feature that worked from one that quietly fell apart.
This article walks through six concrete scenarios drawn from the kinds of features agencies and product teams build every week. Each one pairs an input modality with an output modality for a specific job, and each one carries a lesson you can transfer to your own work. None of them are exotic; they are the ordinary, useful applications where modality choices have real consequences.
Read them less as recipes and more as case sketches. The point is to internalize the reasoning, so that when you face a similar choice, the right modality mix feels obvious instead of arbitrary.
Scenario 1: Voice Memo to Action Items
A sales team records quick voice memos after every call. The feature accepts the audio, transcribes it, and returns a structured list of action items with owners and due dates.
Input: audio. Output: structured data.
The detail that made it work was reasoning over the audio rather than transcribing first and reasoning later. Direct audio handling preserved emphasis and tone, which improved how the model prioritized action items. The structured output meant the items dropped straight into the team's task system. If they had accepted free-form text, they would have spent more effort parsing the result than the feature saved. The best-practices guide makes structured output the default for exactly this reason.
Scenario 2: Photographed Receipt to Expense Entry
An operations team photographs paper receipts and needs them turned into expense records with vendor, amount, date, and category.
Input: image. Output: structured data.
This one nearly failed. The early demo used clean scans and worked perfectly; real receipts arrived crumpled, dim, and angled. The fix was to build a corpus of deliberately bad photos and validate against it, then add a fallback that flagged low-confidence extractions for human review. The lesson is the one from our common-mistakes article: test the worst input first, because the easy input hides the failure.
Scenario 3: Screenshot to Bug Report
A support team pastes screenshots of error states, and the feature reads the image and produces a structured bug report with the observed error, likely cause, and reproduction steps.
Input: image and text. Output: structured data.
The strength here was multimodal fusion. The screenshot supplied the visual error and the agent's typed note supplied context, and the model reasoned over both together. Neither modality alone would have produced a useful report. This is the practical payoff of the shared embedding space described in the definitive guide: a single prompt that references both a picture and a sentence.
Scenario 4: Document to Plain-Language Summary
A client-services team feeds long PDFs (contracts, reports) and needs concise plain-language summaries for non-expert readers.
Input: document. Output: text.
The interesting decision was keeping the output as text rather than structured data, because a human read it directly and no software consumed it. This is the exception to the structured-output rule: when the consumer is a person, prose is the right shape. The team also capped document length to control cost, since long documents consume large numbers of tokens. Knowing the per-modality cost let them set the cap deliberately.
Scenario 5: Text Brief to Generated Image
A marketing team writes short briefs and wants draft illustrations to spark concepts, not final assets.
Input: text. Output: image.
This worked precisely because the team scoped it as "draft, not final." Image generation is slow and imperfect, so they ran it in the background and presented results as starting points. Had they promised polished deliverables, the latency and inconsistency would have sunk the feature. They also confirmed up front that their model could generate images, not merely read them, the separation our beginner's guide hammers on.
Scenario 6: Long Call Recording to Structured Brief
A research team records hour-long customer interviews and needs each one distilled into a structured brief: key themes, notable quotes, and open questions.
Input: audio. Output: structured data.
The challenge was length. A full hour of audio is an expensive input, so the team faced a real cost trade-off. They resolved it by reasoning over the audio in segments and merging the structured results, which kept each request affordable while preserving the whole conversation. The lesson is that long inputs are not just a quality problem; they are a cost problem, and the per-modality budgeting from our step-by-step process is what made the segmentation decision deliberate rather than reactive.
There was also a quality subtlety. By reasoning over the audio directly rather than a flat transcript, the model could attribute quotes to the right speaker and pick up hesitation that signaled an unresolved concern. A transcript would have flattened those cues into plain text and lost them.
What the Six Scenarios Share
Across all six, the successes came from the same handful of moves. Each team picked the smallest modality mix that solved the job. Each demanded structured output wherever software consumed the result and allowed prose only where a human did. Each one that touched images tested messy inputs and built fallbacks. And each respected the cost and latency of rich modalities by capping inputs or backgrounding slow outputs.
The failures, where they appeared, came from the opposite: clean-input optimism, missing fallbacks, and unscoped promises. Notice that none of these lessons are about the model's raw intelligence. They are about how the modalities were wired together. That is almost always where real features are won or lost.
There is one more pattern worth naming. In every successful scenario, the team matched their promise to the modality's real limits rather than its best-case behavior. The receipt feature promised accuracy with a human safety net, not flawless extraction. The image-generation feature promised drafts, not final art. The document summarizer promised a readable digest, not a substitute for reading the contract. Overpromising on a modality is how features that work technically still fail in users' eyes, because the gap between what was promised and what the modality can reliably deliver becomes the user's disappointment. Scoping the promise honestly is as much a modality decision as choosing the inputs and outputs themselves, and it is the one most often skipped.
Frequently Asked Questions
Why use structured output in some scenarios but not others?
The deciding factor is the consumer. When software processes the result, structured output prevents fragile parsing. When a human reads it directly, as in the document summary, prose is the appropriate shape. Match the output form to whoever consumes it.
Was reasoning over audio directly really better than transcribing first?
In the voice-memo scenario, yes. Direct audio reasoning preserved tone and emphasis that transcription discards, which helped the model prioritize action items. Transcribe-first is simpler and cheaper, so the right choice depends on whether those audio cues matter for your task.
How did the receipt feature avoid failing in production?
By testing deliberately bad photos before launch and adding a human-review fallback for low-confidence extractions. The clean demo would have shipped a feature that broke on real receipts; the messy-input corpus surfaced the problem while it was still cheap to fix.
What made the image-generation feature succeed?
Scoping it as drafts rather than final assets. Generated images are slow and imperfect, so the team backgrounded the work and framed outputs as starting points. Matching the promise to the modality's real limits is what kept users satisfied.
Can I combine more than two modalities in one feature?
Yes, the bug-report scenario combines image and text input. The caution is to add each modality only when it earns its place, since every one adds cost, latency, and a new failure mode. Combine deliberately, not reflexively.
Key Takeaways
- Match the output shape to its consumer: structured data for software, prose for humans.
- Image-based features live or die on messy-input testing and sensible fallbacks.
- Multimodal fusion lets a screenshot and a typed note combine into something neither could produce alone.
- Respect the cost and latency of rich modalities by capping inputs and backgrounding slow outputs.
- The best features use the smallest modality mix that solves the actual job.