Multilingual Prompts in the Wild: What Worked, What Broke

Abstract advice about multilingual prompting only goes so far. To make the patterns stick, it helps to see them play out in specific situations: a real task, a real prompt decision, and a clear account of why the output landed well or poorly. This article walks through several concrete scenarios drawn from common product and content workflows.

Each example pairs a setup with the prompt choice that mattered and the outcome it produced. Where a scenario went wrong, we trace the failure to its cause and show the change that fixed it. The aim is to give you a mental library of cases you can pattern-match against when you hit something similar.

The scenarios span customer support, marketing content, structured data, and multi-turn conversation, because each surfaces a different facet of the multilingual problem.

Scenario 1: Customer Support Replies Across Markets

A support tool drafts replies to tickets that arrive in many languages.

What worked

The team's prompt named the output language explicitly based on a detected language field rather than relying on the model to mirror the ticket. It also instructed the model to address the customer formally and to localize any dates or order numbers to the customer's market. Replies in Spanish, German, and Japanese all read as native and appropriately polite.

What broke at first

Early versions inferred the language from the ticket text. When customers wrote partly in English, the model replied in English even though the customer's account was set to French. Pinning the output language to the account setting, not the message content, fixed it. Our Seven Ways Multilingual Prompts Quietly Go Wrong calls this exact trap out as Mistake 1.

Scenario 2: Marketing Copy With Regional Nuance

A campaign needed product descriptions in Latin American Spanish and Brazilian Portuguese.

What worked

The prompt specified the market, not just the language, and asked the model to adapt idioms by meaning and avoid culture-specific references that would not travel. The Brazilian copy used phrasing distinct from European Portuguese, which is exactly what the audience expected.

The lesson

When the first draft used generic "Spanish," a reviewer in Mexico flagged several word choices as feeling European. Adding the market to the prompt resolved it without any other change, reinforcing the point that the market is the real unit of localization.

Scenario 3: Structured Output With Translated Values

A product catalog needed JSON where field values were translated but keys stayed fixed for the database.

What worked

The prompt stated explicitly that keys must remain in English and only the values should be localized. The output parsed cleanly and the localized values displayed correctly in the storefront.

What broke at first

An earlier prompt said simply "translate this into German," and the model dutifully translated the JSON keys too, breaking the parser downstream. Separating fixed keys from translatable values is a small instruction with a large payoff, covered further in our A Framework for Prompting for Multilingual Output.

Scenario 4: A Low-Resource Language

A team needed output in a language with limited training coverage.

What worked partially

The output sounded fluent but a native reviewer caught invented words and an inconsistent register. Providing a short glossary of correct terms and two high-quality example sentences in the target language improved accuracy noticeably.

The honest outcome

Even with scaffolding, quality remained below the team's bar for customer-facing use, so they routed that language to a professional translation service while keeping direct generation for their high-resource languages. Knowing when to stop pushing the model is itself a skill.

Scenario 5: Multi-Turn Conversation Drift

A chat assistant needed to hold one language across a long conversation.

What worked

Moving the language and formality requirements into the system instruction kept Korean consistent across a dozen turns. Without that, the assistant had drifted into English by the third or fourth exchange.

Why placement mattered

End-of-prompt instructions handle a single generation well but fade across turns. System-level placement is what made the constraint persist, a distinction our Hard-Won Habits for Multilingual AI That Holds Up treats as a core practice.

Scenario 6: Reasoning in One Language, Answering in Another

A task required careful analysis before a Japanese-language answer.

What worked

The prompt let the model reason internally in English, where its analysis was sharper, then output only the final answer in Japanese. The reasoning quality stayed high and the user never saw the English working.

The caveat

The team had to verify the reasoning truly stayed hidden; an early version leaked a stray English sentence into the answer. Explicitly instructing the model to output only the final answer in the target language closed the gap.

Scenario 7: Email Subject Lines With Length Limits

A marketing team needed localized email subject lines that stayed within a character budget.

What worked

The prompt named the target language and market, asked for an idiomatic subject line rather than a literal translation, and stated a hard character limit. The model adapted the message to fit, sometimes rephrasing entirely rather than truncating.

What broke at first

An early prompt translated the English subject line directly, which blew past the limit in German, where compound words run long, and read awkwardly in Japanese. Asking for an idiomatic line within the limit, rather than a translation, let the model compose something that both fit and sounded natural. The lesson is that constraints like length interact with language, and the model handles that interaction far better when you let it compose fresh.

Scenario 8: A Mixed-Language Source Document

A team needed to summarize documents that contained two languages and produce a single-language summary.

What worked

The prompt instructed the model to read the source in whatever languages it appeared and produce the summary entirely in one named target language. Stating the output language explicitly, independent of the messy multilingual input, kept the summary clean and consistent.

Why it mattered

Without the explicit output-language instruction, the summary inherited the mix of the source, switching languages partway through. Decoupling output language from input language, a theme across these scenarios, was again the fix.

Frequently Asked Questions

What is the most common thread across these examples?

Explicitness. In nearly every case, the failure came from leaving something to inference, the language, the market, the boundary between keys and values, and the fix came from stating it directly. Multilingual prompting rewards saying exactly what you want.

When did direct generation lose to translation?

In the low-resource language scenario, where the model produced fluent but inaccurate text the team could not bring up to standard. There, a professional translation service was the right call. Direct generation is a strong default, not a universal answer.

How did teams catch errors in languages they could not read?

Through layered review: automated language detection, back-translation for meaning, and native reviewers for tone and fluency. The Spanish market-fit issue and the low-resource invented words were both caught by native reviewers, which is why that step is non-negotiable for customer-facing content.

Do these patterns transfer across different models?

The principles do. The specific behaviors, how strongly a model drifts, how well it handles a given language, vary by model, so retest your prompts when you switch. The habits of explicitness, market targeting, and layered evaluation hold regardless of which model you use.

Key Takeaways

Pin the output language to a reliable signal like an account setting, not the language of the incoming message.
Specify the market, not just the language, so regional vocabulary and tone come out right.
For structured output, state which fields are translated and which stay fixed to protect downstream parsing.
Low-resource languages may need glossaries, examples, or a professional service when generation quality falls short.
Persist language and formality in the system message for multi-turn chats, and keep internal reasoning separate from the answer.

The scenarios span customer support, marketing content, structured data, and multi-turn conversation, because each surfaces a different facet of the multilingual problem.

Scenario 1: Customer Support Replies Across Markets

A support tool drafts replies to tickets that arrive in many languages.

What worked

What broke at first

Scenario 2: Marketing Copy With Regional Nuance

A campaign needed product descriptions in Latin American Spanish and Brazilian Portuguese.

What worked

The lesson

Scenario 3: Structured Output With Translated Values

A product catalog needed JSON where field values were translated but keys stayed fixed for the database.

What worked

The prompt stated explicitly that keys must remain in English and only the values should be localized. The output parsed cleanly and the localized values displayed correctly in the storefront.

What broke at first

Scenario 4: A Low-Resource Language

A team needed output in a language with limited training coverage.

What worked partially

The honest outcome

Scenario 5: Multi-Turn Conversation Drift

A chat assistant needed to hold one language across a long conversation.

What worked

Why placement mattered

Scenario 6: Reasoning in One Language, Answering in Another

A task required careful analysis before a Japanese-language answer.

What worked

The caveat

Scenario 7: Email Subject Lines With Length Limits

A marketing team needed localized email subject lines that stayed within a character budget.

What worked

What broke at first

Scenario 8: A Mixed-Language Source Document

A team needed to summarize documents that contained two languages and produce a single-language summary.

What worked

Why it mattered

Frequently Asked Questions

What is the most common thread across these examples?

When did direct generation lose to translation?

How did teams catch errors in languages they could not read?

Do these patterns transfer across different models?

Key Takeaways

Pin the output language to a reliable signal like an account setting, not the language of the incoming message.
Specify the market, not just the language, so regional vocabulary and tone come out right.
For structured output, state which fields are translated and which stay fixed to protect downstream parsing.
Low-resource languages may need glossaries, examples, or a professional service when generation quality falls short.
Persist language and formality in the system message for multi-turn chats, and keep internal reasoning separate from the answer.

Multilingual Prompts in the Wild: What Worked, What Broke

Scenario 1: Customer Support Replies Across Markets

What worked

What broke at first

Scenario 2: Marketing Copy With Regional Nuance

What worked

The lesson

Scenario 3: Structured Output With Translated Values

What worked

What broke at first

Scenario 4: A Low-Resource Language

What worked partially

The honest outcome

Scenario 5: Multi-Turn Conversation Drift

What worked

Why placement mattered

Scenario 6: Reasoning in One Language, Answering in Another

What worked

The caveat

Scenario 7: Email Subject Lines With Length Limits

What worked

What broke at first

Scenario 8: A Mixed-Language Source Document

What worked

Why it mattered

Frequently Asked Questions

What is the most common thread across these examples?

When did direct generation lose to translation?

How did teams catch errors in languages they could not read?

Do these patterns transfer across different models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Multilingual Prompts in the Wild: What Worked, What Broke

Scenario 1: Customer Support Replies Across Markets

What worked

What broke at first

Scenario 2: Marketing Copy With Regional Nuance

What worked

The lesson

Scenario 3: Structured Output With Translated Values

What worked

What broke at first

Scenario 4: A Low-Resource Language

What worked partially

The honest outcome

Scenario 5: Multi-Turn Conversation Drift

What worked

Why placement mattered

Scenario 6: Reasoning in One Language, Answering in Another

What worked

The caveat

Scenario 7: Email Subject Lines With Length Limits

What worked

What broke at first

Scenario 8: A Mixed-Language Source Document

What worked

Why it mattered

Frequently Asked Questions

What is the most common thread across these examples?

When did direct generation lose to translation?

How did teams catch errors in languages they could not read?

Do these patterns transfer across different models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?