Ask any team that has shipped a customer-facing assistant in more than one language, and you will hear the same handful of complaints. The model slips back into English halfway through. It translates a brand name that should never be touched. It produces grammatically correct Spanish that no native speaker would ever say. These are not exotic edge cases — they are the default failure modes when you treat multilingual generation as an afterthought.
This article collects the questions practitioners actually ask when they start prompting for non-English output, and answers each one with patterns you can use immediately. The goal is not theory. It is to give you the specific phrasing, structure, and guardrails that move you from "mostly works" to "ships in production."
Throughout, we assume you are working with a modern instruction-tuned model that has meaningful multilingual coverage. The techniques apply whether you are generating support replies, marketing copy, or structured data with localized fields.
Why Does the Model Keep Reverting to English?
This is the single most reported problem, and it almost always traces back to one of three causes.
The Instruction Is Buried
If your language instruction sits in the middle of a 600-word system prompt, the model weights it less than the immediate content of the user message. When the user writes in English, the model mirrors that language. The fix is to make the target language the most recent and most explicit instruction before generation begins.
The Examples Are in English
Few-shot examples are powerful, and that power cuts both ways. If every example in your prompt shows English input and English output, you have effectively trained the model — within that single request — to produce English. Localize your examples to the target language, or at minimum show the input-output pair in the language you want back.
No Explicit Constraint on Reverting
Models hedge. Given ambiguity, they default to the highest-probability language in their training distribution, which is usually English. State the constraint as a hard rule: "Respond only in Brazilian Portuguese. Do not include any English words except proper nouns and untranslatable technical terms." Naming the exceptions you will tolerate prevents over-correction, like a model refusing to keep a product name intact.
For a deeper treatment of structuring these constraints, see Building a Repeatable Workflow for Prompting for Multilingual Output.
How Do I Specify the Language Precisely?
"Write in Spanish" is ambiguous. Spanish for Spain differs from Spanish for Mexico in vocabulary, formality, and even punctuation conventions.
Use Locale Codes and Region Names
Specify the variant explicitly: "Mexican Spanish (es-MX)" or "European French (fr-FR)." The locale code anchors the model, and the human-readable region name reinforces it. This matters most for languages with large regional spreads — Spanish, Portuguese, Arabic, and Chinese among them.
State the Register Separately
Language and formality are independent axes. Decide whether you want the formal or informal second person, then say so: "Use the formal register (usted)." German, Japanese, Korean, and French all encode social distance grammatically, and getting it wrong reads as either cold or presumptuous.
Should I Translate or Generate Natively?
A frequent strategic question. There are two approaches, and they produce different results.
Translation Pipeline
You generate in English, then translate. This is predictable and easy to review, but it carries the structure and idioms of the source language. The output often reads as translated — technically fine, subtly foreign.
Native Generation
You prompt the model to think and write directly in the target language. This produces more idiomatic results because the model is not anchored to an English scaffold. The trade-off is harder review if your team does not read the language. The middle path many teams choose is covered in The Prompting for Multilingual Output Playbook.
How Do I Keep Certain Terms Untranslated?
Brand names, product names, legal terms, and code identifiers should usually stay fixed across languages.
Provide a Do-Not-Translate List
Inline a short glossary: "Keep these terms exactly as written in any language: Acme Cloud, OAuth, webhook." Models respect explicit lists far more reliably than vague instructions to "preserve technical terms."
Wrap Protected Spans
For structured pipelines, wrap fixed terms in a sentinel like [[Acme Cloud]] and strip the brackets in post-processing. This gives you a deterministic guarantee rather than relying on model compliance.
How Do I Verify Quality Without Speaking the Language?
You cannot fully solve this without native review, but you can catch a large share of problems automatically.
Round-Trip Checks
Translate the output back to English with a separate call and compare meaning against the source. Large semantic drift signals a problem worth human review. This will not catch tone or register issues, but it reliably catches dropped content and mistranslations.
Language Detection Gates
Run a language-detection library on the output. If the detected language is not the target, reject and retry. This single guardrail eliminates the embarrassing case of English leaking into a Japanese response.
Sample Human Review
Budget for a native speaker to review a rotating sample. Automated checks find structural failures; only a human catches the subtle unnaturalness that erodes trust. Common pitfalls here are catalogued in Prompting for Multilingual Output: Best Practices That Actually Work.
Does Output Quality Vary by Language?
Yes, substantially, and pretending otherwise leads to uneven user experiences.
High-Resource Versus Low-Resource
Languages with abundant training data — Spanish, French, German, Chinese, Japanese — produce strong output. Lower-resource languages show more grammatical errors, awkward phrasing, and occasional fabrication. Calibrate your review intensity to the resource level of each target language.
Script and Direction Considerations
Right-to-left scripts like Arabic and Hebrew introduce rendering and formatting complications that have nothing to do with the model. Test your full pipeline, including the display layer, not just the raw text. For a tour of concrete scenarios, see Prompting for Multilingual Output: Real-World Examples and Use Cases.
Frequently Asked Questions
Can one prompt handle many languages at once?
It can, but reliability drops as you add languages. A safer pattern is a single template with the target language injected as a variable, run once per language. This keeps each generation focused and makes per-language review tractable. Reserve true multi-language single prompts for low-stakes content.
Will giving the instruction in the target language help?
Often, yes. Writing your system instruction in the target language nudges the model into that language's distribution before generation even starts. A hybrid works well: state the rules in English for clarity, then add a short directive in the target language as the final line.
How do I handle mixed-language input from users?
Decide on a policy and encode it. Common choices are to respond in the language of the majority of the input, respond in a fixed default, or detect and mirror the user's language. Whichever you pick, state it explicitly in the prompt so the model does not guess.
Do emojis and formatting transfer across languages?
Formatting transfers, but conventions differ. Date formats, number separators, and quotation marks vary by locale. If you need locale-correct formatting, instruct the model explicitly or handle it in post-processing rather than assuming the model localizes these details.
Is fine-tuning worth it for one target language?
Usually not as a first step. Prompt engineering and a good glossary solve most problems. Consider fine-tuning only when you have high volume in a single language, a consistent house style, and measurable quality gaps that prompting cannot close.
Key Takeaways
- The model reverts to English when the instruction is buried, the examples are English, or no hard constraint forbids reverting — fix all three.
- Specify language with locale codes and region names, and state register separately from language.
- Decide deliberately between translation pipelines and native generation; each has distinct review trade-offs.
- Protect brand and technical terms with explicit do-not-translate lists or sentinel wrapping.
- Verify with round-trip checks and language-detection gates, but budget for native human review on a sample.
- Expect quality to vary by language resource level, and calibrate review effort accordingly.