Multilingual output fails in predictable ways. The same handful of mistakes show up across teams, products, and languages, and most of them are invisible to the person who wrote the prompt because they cannot read the result. That invisibility is exactly what makes these errors expensive: they reach customers before anyone notices.
This article names seven failure modes directly. For each, we explain why it happens, what it costs, and the corrective practice that prevents it. None of these fixes are exotic. They are habits you adopt once and then never think about again, which is the whole point.
If you are building a multilingual feature for the first time, read this alongside our step-by-step process so you can design the mistakes out from the start rather than debugging them later.
Mistake 1: Leaving the Output Language to Chance
The most common error is assuming the model will infer the right language from context. It often does not. When the instruction is vague, the model defaults to the input language or to English.
Why it happens
Authors test with input that happens to be in the target language, so the model mirrors it, and they conclude the prompt works. The moment input arrives in a different language, the output language follows the input rather than the requirement.
The fix
Name the output language explicitly and independently of the input language. "Respond in Italian" should not depend on the user having written in Italian.
Mistake 2: Ignoring Regional Variants
Treating "Spanish" or "Portuguese" or "Chinese" as a single thing produces text that feels foreign to a large share of your audience.
The cost
Vocabulary, idiom, and tone differ enough across regions that readers notice immediately. A Mexican customer reading Castilian phrasing perceives a brand that did not bother to localize, which erodes trust.
The fix
Specify the variant and market in the prompt: Brazilian Portuguese, Latin American Spanish, Simplified Chinese. Our Getting Models to Speak Every Language Your Users Do explains why the market, not just the language, drives quality.
Mistake 3: Tolerating Language Drift
Output starts correct, then English words, headings, or whole sentences creep back in.
Why it happens
English is the dominant attractor in most models. In long prompts and multi-turn chats, the language instruction loses influence as generation continues.
The fix
Repeat the language instruction at the end of the prompt, reinforce it in the system message for multi-turn use, and keep any internal reasoning separate from the user-facing answer so reasoning in English does not bleed into output.
Mistake 4: Getting the Formality Wrong
Producing the right language in the wrong register is its own failure. A casual tone where formality is expected reads as disrespectful; excessive formality reads as cold or robotic.
The cost
In languages that grammaticalize politeness, this is not a stylistic nuance, it is a social error that real readers react to. It can make a brand seem careless or even rude.
The fix
Specify the address form and tone explicitly, tied to the relationship: "Address the reader formally, as a business addresses a new customer." Our 7 Common Mistakes with Prompting for Multilingual Output is a starting point, but Prompting for Multilingual Output: Best Practices That Actually Work goes deeper on register.
Mistake 5: Shipping Output No One Can Read
Deploying multilingual content with no evaluation path is the riskiest mistake because errors stay invisible until a customer reports them.
Why it happens
Teams reasonably assume that fluent-sounding output is correct. Models produce confident, grammatical text even when it contains real errors, especially in lower-resource languages.
The fix
Build evaluation in before launch: automated language detection to confirm the language, back-translation to check meaning, and native speaker spot checks for important content. Treat "we cannot read it" as a blocker, not a footnote.
Mistake 6: Forgetting Localization of Formats
Even perfect language fails if dates, currencies, units, and number formats stay in your home market's conventions.
The cost
A price in the wrong currency format or a date in the wrong order causes practical confusion and signals a sloppy localization. In transactional contexts it can cause real errors.
The fix
Instruct the model to localize dates, currency, units, and numbers to the target market, and name the market explicitly so it has the information to do so.
Mistake 7: Breaking Structured Output
When responses must follow a schema, such as JSON, naive prompting causes the model to translate field names along with values, breaking downstream parsing.
Why it happens
The model applies the "translate everything" instruction uniformly because the prompt did not distinguish fixed keys from translatable values.
The fix
State explicitly which parts are translated and which stay fixed: keys in English, values localized. Our A Framework for Prompting for Multilingual Output includes patterns for mixed structured and natural-language output.
How These Mistakes Compound
The mistakes above rarely appear alone. They reinforce each other in ways that make the combined damage worse than any single error.
Invisible errors hide the others
When you have no evaluation path (Mistake 5), every other mistake becomes undetectable. A formality slip, a wrong regional variant, and a broken localization can all coexist in shipped output, and you would never know because no one is reading it in a structured way. This is why teams that fix only one mistake often see no improvement: the underlying problem was that they could not see any of them.
Drift masquerades as a model limitation
Teams frequently conclude that a model simply cannot hold a language, when in fact weak instructions, missing end-of-prompt placement, and reasoning leaking into output are causing the drift. Blaming the model leads to expensive workarounds, like switching providers, when the real fix is a stronger prompt. Diagnosing the cause correctly saves both time and money.
The corrective order that works
Fix the evaluation path first so you can see what is happening. Then address language control and drift, because those affect every output. Localization and structured-output fixes come next, since they tend to be more contained. Working in this order means each fix is verifiable by the evaluation path you established first.
Frequently Asked Questions
Which mistake causes the most damage?
Shipping output no one can read (Mistake 5) tends to be the most costly because it hides every other error. A formality slip or a localization gap is bad, but without an evaluation path you will not even know it happened until customers complain. Build the review loop first.
Is language drift the model's fault or the prompt's?
Both, but you control the prompt. Drift reflects the model's bias toward English, yet explicit instructions, end-of-prompt placement, and system-level reinforcement reliably suppress it. Treat drift as something you manage rather than something you tolerate.
How do I catch these mistakes if I do not speak the language?
Layer automated language detection, back-translation for meaning, and periodic native speaker review. The first two scale and run without a human reader; the third catches the subtle tone and fluency issues the automated checks miss.
Do these mistakes apply equally to all languages?
No. High-resource languages are more forgiving, while lower-resource languages amplify every one of these failure modes, especially fluency errors and drift. Concentrate your review effort where the model is weakest.
Key Takeaways
- Never leave the output language to inference; name it explicitly and independently of the input language.
- Specify regional variants and localize dates, currency, units, and numbers to the target market.
- Suppress English drift with end-of-prompt instructions and system-level reinforcement, and keep reasoning separate from output.
- Get formality right; in many languages it is a social requirement, not a style choice.
- Build an evaluation path before launch, and protect structured output by separating fixed keys from translatable values.