Multilingual Prompting: How to Fix Non-English LLM Outputs

Most of us assume that if a large language model (LLM) is smart in English, it should be equally smart in Spanish, Japanese, or Arabic. The reality is often disappointing. You ask for a summary in French, and the model gives you a hallucinated mess. You request code comments in German, and the grammar falls apart. This isn't just bad luck; it's a structural flaw in how most current models are trained. They are heavily biased toward English data, leaving other languages with thinner training grounds.

This gap creates a frustrating experience for developers and businesses trying to deploy AI globally. But there is good news. You don't need to retrain the entire model to fix this. By changing multilingual prompting strategies-specifically how you structure instructions, context, and reasoning steps-you can dramatically improve accuracy, reduce hallucinations, and get fluent results in dozens of languages. Here is how to do it right, based on the latest research from 2023 to 2025.

The Problem: Why Non-English Outputs Fail

To fix the output, you first have to understand why it breaks. Benchmarks like MultiChallenge, analyzed by language technology experts at LILT, show that models suffer substantial drops in accuracy when moving away from English. This isn't because the model is "stupid" in other languages. It’s because of three main factors:

Data Imbalance: There is far less high-quality text available for low-resource languages compared to English. The model simply hasn't seen enough examples to learn robust patterns.
Tokenization Issues: Scripts like logographic (Chinese) or abugida (Thai) systems are tokenized differently than Latin alphabets. This inefficiency means the model uses more "tokens" to understand the same amount of information, diluting its attention.
Morphology and Syntax: Complex grammatical structures in languages like Arabic or Hindi are poorly represented in standard training corpora.

When you prompt directly in these languages without specific techniques, the model relies on weak signals. The result? Less accurate answers, broken fluency, and higher rates of fabrication. Multilingual prompting is the workaround that bridges this gap without touching the model weights.

Cross-Lingual Thought Prompting (XLT)

One of the most effective methods discovered so far is Cross-Lingual Thought Prompting, or XLT. Introduced in research published in EMNLP Findings 2023, XLT turns the model’s weakness into a strength by leveraging its superior English reasoning capabilities.

The core idea is simple: instruct the model to use English as its internal "brain" while communicating with you in your target language. Think of it like a translator who thinks in their native tongue but speaks yours perfectly. The XLT template asks the model to:

Assume the role of an expert in the target language.
Translate or restate the task internally in English.
Perform step-by-step logical reasoning (Chain-of-Thought) in English.
Produce the final answer in the target language.

This is a parameter-frozen method, meaning you don’t change the model at all. You just change the prompt. In experiments, this approach improved performance in arithmetic reasoning and open-domain question answering by over 10 points on average across various languages. It also significantly reduced the performance gap between high-resource and low-resource languages.

Here is a practical example of how to structure an XLT prompt for a legal assistant in Brazilian Portuguese:

"You are an expert legal assistant fluent in Brazilian Portuguese. To ensure accuracy, please follow these steps: 1. Internally translate the user's query into English. 2. Reason through the legal implications step-by-step in English. 3. Translate your final conclusion back into clear, professional Brazilian Portuguese. User Query: [Insert Query Here]"

This forces the model to access its strongest reasoning pathways before generating the final output. Just be aware that Chain-of-Thought prompts increase token length and latency, so reserve this for tasks where accuracy matters more than speed.

Cubist depiction of a brain reasoning in English while speaking another language

Selective Pre-Translation: Don't Translate Everything

Another common mistake is translating the entire prompt into English before sending it to the model, or vice versa. Research from Bar-Ilan University, published in February 2025 by Itai Mondshine, Tzuf Paz-Argaman, and Reut Tsarfaty, shows that "selective pre-translation" works better than full translation or direct inference.

The key insight is that different parts of a prompt serve different purposes. You shouldn't treat them all the same way. The optimal strategy depends heavily on whether your task is extractive (pulling facts from text) or generative (creating new content).

Optimal Prompt Component Strategy by Task Type
Task Type	Instructions	Context/Source Text	Examples	Output Language
Extractive (QA, NER)	Flexible (English or Source)	Keep in Source Language	Keep in Source Language	Source Language (except NER labels in Low-Resource langs, use English)
Generative (Summarization)	Translate to English	Translate to English	Translate to English	Generate in English, then MT back to Source

For extractive tasks like Question Answering or Named Entity Recognition (NER), keeping the context in the source language is crucial. If you translate the source text, you might distort the exact spans needed for extraction. Similarly, providing few-shot examples in the source language gives the model a strong signal of how to behave in that specific linguistic context.

However, for generative tasks like summarization, the opposite is true. Translating everything to English, generating the summary in English, and then using a high-quality machine translation (MT) system to convert it back yields better results, especially for low-resource languages. This is because the model’s stylistic control and safety alignment are strongest in English.

Boosting Diversity and Reducing Hallucinations

Beyond accuracy, multilingual prompting can help you control the quality and diversity of the output. A 2025 arXiv study on "Multilingual Prompting for Improving LLM Generation Diversity" found that language choice is an active control dimension.

If you find your model repeating itself or hallucinating facts in a non-English language, try running the same query in multiple languages and aggregating the results. For instance, prompt the model in both English and the target language, compare the outputs, and select the most consistent answer. This technique reduces the model’s tendency to fabricate information because it has to align its response across different linguistic representations. It’s particularly useful for creative brainstorming or safety-oriented pipelines where factual integrity is paramount.

Cubist illustration of a multi-stage translation and processing pipeline

Industry Pipelines: The T-LM Approach

While individual prompts help, enterprises often need scalable solutions. Companies like ModernMT have deployed production systems such as T-LM (Translation-Layer Model). This approach embeds multilingual prompting into the workflow architecture.

In a T-LM pipeline:

Pre-processing: User inputs in any language are translated into English using a domain-adapted MT engine.
Reasoning: The English prompt is sent to a powerful English-optimized LLM (like GPT-4) to perform the task.
Post-processing: The English output is translated back into the user’s original language.

This mirrors the XLT logic but externalizes the translation layer. It allows companies to reuse their existing English-centric workflows for customer support or knowledge access in dozens of languages. However, remember the warning from the selective pre-translation research: translation quality is critical. Poor MT quality will degrade your results, regardless of how good the LLM is. Always use terminology databases or glossaries to preserve domain consistency during the back-translation phase.

Practical Checklist for Better Non-English Outputs

So, what should you do today? Here is a quick decision tree to guide your prompt engineering efforts:

Is the task Extractive (QA/NER)? Keep the source text and examples in the original language. Only translate instructions if necessary. For NER in low-resource languages, ask for entity labels in English to leverage stronger label semantics.
Is the task Generative (Summary/Creative)? Translate instructions, context, and examples to English. Generate the output in English, then translate it back using a high-quality MT tool.
Do you need complex reasoning? Use the XLT template. Explicitly ask the model to reason in English before outputting the target language.
Are you seeing hallucinations? Try multi-language prompting. Run the query in English and the target language, then aggregate the best parts of both responses.

By treating language not just as a carrier of content but as a structural component of your prompt, you unlock significantly better performance. You don't need to wait for the next generation of fully multilingual models. With these techniques, you can get near-English quality in almost any language right now.

What is multilingual prompting?

Multilingual prompting is the practice of strategically choosing the languages and structure of instructions, context, and examples in prompts to Large Language Models (LLMs) to improve the quality, accuracy, and fluency of non-English outputs. It leverages the model's stronger English capabilities to bridge gaps in lower-resource languages.

Why do LLMs perform worse in non-English languages?

LLMs often underperform in non-English languages due to training data imbalances (less high-quality text available), tokenization inefficiencies for certain scripts, and poor representation of complex morphology and syntax in training corpora. This leads to lower accuracy and higher hallucination rates.

How does Cross-Lingual Thought Prompting (XLT) work?

XLT is a prompt template that instructs the LLM to act as an expert in the target language, internally translate the task to English, perform step-by-step reasoning in English, and then produce the final answer in the target language. This utilizes the model's superior English reasoning skills to boost performance in other languages.

Should I translate my entire prompt to English?

Not always. For extractive tasks like Question Answering, keeping the context and examples in the source language is better. For generative tasks like summarization, translating the entire prompt to English, generating in English, and then translating back often yields higher quality results, especially for low-resource languages.

Can multilingual prompting reduce hallucinations?

Yes. Research suggests that using multiple languages in prompts (e.g., querying in both English and the target language) and aggregating the results can reduce hallucination rates. This forces the model to align its factual claims across different linguistic representations, leading to more reliable outputs.

10 Comments

Saranya M.L.
June 13, 2026 AT 04:56

As a native speaker of several low-resource languages, I find the tokenization argument particularly reductive. It is not merely about 'token count' but about the semantic density inherent in agglutinative and fusional morphologies that English simply does not possess. When you force a model to reason in English, you are effectively stripping away the cultural and linguistic nuance that defines the target language's logic. The XLT method is essentially a colonialist workaround for lazy engineering. We need models trained on high-quality local corpora, not just translated English garbage.
Bineesh Mathew
June 13, 2026 AT 08:02

The tragedy of modern AI is that it treats language as a mere container for data rather than the very fabric of thought itself. To suggest we should think in English to understand Hindi or Arabic is to admit our intellectual bankruptcy. We are building digital Babels where the tower crumbles because the foundation is built on the shifting sands of Anglo-centric bias. It is a moral failure disguised as technical optimization. One wonders if the silicon gods will ever forgive us for this linguistic imperialism.
om gman
June 13, 2026 AT 18:31

lol everyone here acting like english reasoning is some sacred temple. its just stats. if your model cant handle hindi syntax without translating to english first then your training data sucks. stop making excuses for bad engineering and start fixing the datasets. also xlt adds latency which kills real time apps so its useless for most production environments
Edward Nigma
June 14, 2026 AT 20:30

You're all missing the point entirely. The article isn't saying English is superior; it's saying the current architecture is biased. Using XLT is a pragmatic hack, not a philosophical statement. If you want pure multilingual reasoning, wait another decade. Until then, use the tools available. Stop being contrarian for the sake of it and look at the benchmarks. The accuracy gains are real regardless of your feelings about linguistic imperialism.
Jeanne Abrahams
June 16, 2026 AT 16:37

Oh, spare me the dramatics about 'silicon gods.' I'm a developer in Cape Town working with Zulu and Afrikaans, and let me tell you, when your customer support bot hallucinates legal advice because it tried to translate idioms literally, you don't have time for philosophy. You fix the prompt. The selective pre-translation strategy for extractive tasks saved my last project. Keep the source text in the original language for NER, translate instructions to English. It works. Use it or cry about it.
Robert Barakat
June 17, 2026 AT 02:08

The essence of communication lies not in the perfection of the vessel but in the intent of the message. These techniques are merely scaffolding for a structure that has yet to find its true form. We impose order on chaos through language, and now we ask machines to do the same. Perhaps the flaw is not in the model, but in our expectation of uniformity across diverse human expressions.
Francis Laquerre
June 18, 2026 AT 21:00

I have been implementing the T-LM approach in our enterprise workflow for French and Japanese markets, and the results are transformative. By externalizing the translation layer, we maintain consistency in terminology while leveraging the robust reasoning capabilities of English-optimized models. It is crucial, however, to invest heavily in domain-specific glossaries during the back-translation phase. Without this, the nuance is lost, and the output feels sterile. This is not a compromise; it is an evolution of how we interact with global audiences.
michael rome
June 18, 2026 AT 22:40

It is imperative that we consider the ethical implications of these prompting strategies. While efficiency is paramount, we must ensure that the quality of information delivered to non-English speakers is not diminished by these workarounds. The checklist provided is excellent for immediate implementation, but long-term solutions require more inclusive training data. Let us collaborate to build systems that respect linguistic diversity while maintaining high standards of accuracy and reliability for all users.
Andrea Alonzo
June 20, 2026 AT 21:57

I really appreciate how this article breaks down the technical aspects into actionable steps, especially for those of us who are still navigating the complexities of multilingual AI deployment. It can be quite overwhelming to figure out whether to translate everything or keep parts in the source language, but the distinction between extractive and generative tasks makes so much sense once you see it laid out clearly. I’ve started experimenting with the XLT template for our legal documents, and while there is a slight increase in processing time, the improvement in logical coherence is undeniable. It’s reassuring to know that we don’t have to wait for perfect models to provide better service to our global clients, and I hope more developers share their experiences with these techniques so we can all learn from each other’s successes and failures in this evolving landscape.
Oskar Falkenberg
June 22, 2026 AT 02:13

Thats a really good point about the tokenization issues with logographic scripts. Ive noticed that chinese prompts often hit context limits faster than english ones even when the content is similar. The tip about aggregating results from multiple languages to reduce hallucinations is brilliant. I tried running a query in both german and english and comparing the outputs, and it did catch a factual error that the single language prompt missed. Its a bit more expensive in terms of api calls but worth it for critical tasks. Just wish the latency wasnt so high for real time applications though