Most of us assume that if a large language model (LLM) is smart in English, it should be equally smart in Spanish, Japanese, or Arabic. The reality is often disappointing. You ask for a summary in French, and the model gives you a hallucinated mess. You request code comments in German, and the grammar falls apart. This isn't just bad luck; it's a structural flaw in how most current models are trained. They are heavily biased toward English data, leaving other languages with thinner training grounds.
This gap creates a frustrating experience for developers and businesses trying to deploy AI globally. But there is good news. You don't need to retrain the entire model to fix this. By changing multilingual prompting strategies-specifically how you structure instructions, context, and reasoning steps-you can dramatically improve accuracy, reduce hallucinations, and get fluent results in dozens of languages. Here is how to do it right, based on the latest research from 2023 to 2025.
The Problem: Why Non-English Outputs Fail
To fix the output, you first have to understand why it breaks. Benchmarks like MultiChallenge, analyzed by language technology experts at LILT, show that models suffer substantial drops in accuracy when moving away from English. This isn't because the model is "stupid" in other languages. It’s because of three main factors:
- Data Imbalance: There is far less high-quality text available for low-resource languages compared to English. The model simply hasn't seen enough examples to learn robust patterns.
- Tokenization Issues: Scripts like logographic (Chinese) or abugida (Thai) systems are tokenized differently than Latin alphabets. This inefficiency means the model uses more "tokens" to understand the same amount of information, diluting its attention.
- Morphology and Syntax: Complex grammatical structures in languages like Arabic or Hindi are poorly represented in standard training corpora.
When you prompt directly in these languages without specific techniques, the model relies on weak signals. The result? Less accurate answers, broken fluency, and higher rates of fabrication. Multilingual prompting is the workaround that bridges this gap without touching the model weights.
Cross-Lingual Thought Prompting (XLT)
One of the most effective methods discovered so far is Cross-Lingual Thought Prompting, or XLT. Introduced in research published in EMNLP Findings 2023, XLT turns the model’s weakness into a strength by leveraging its superior English reasoning capabilities.
The core idea is simple: instruct the model to use English as its internal "brain" while communicating with you in your target language. Think of it like a translator who thinks in their native tongue but speaks yours perfectly. The XLT template asks the model to:
- Assume the role of an expert in the target language.
- Translate or restate the task internally in English.
- Perform step-by-step logical reasoning (Chain-of-Thought) in English.
- Produce the final answer in the target language.
This is a parameter-frozen method, meaning you don’t change the model at all. You just change the prompt. In experiments, this approach improved performance in arithmetic reasoning and open-domain question answering by over 10 points on average across various languages. It also significantly reduced the performance gap between high-resource and low-resource languages.
Here is a practical example of how to structure an XLT prompt for a legal assistant in Brazilian Portuguese:
"You are an expert legal assistant fluent in Brazilian Portuguese. To ensure accuracy, please follow these steps: 1. Internally translate the user's query into English. 2. Reason through the legal implications step-by-step in English. 3. Translate your final conclusion back into clear, professional Brazilian Portuguese. User Query: [Insert Query Here]"
This forces the model to access its strongest reasoning pathways before generating the final output. Just be aware that Chain-of-Thought prompts increase token length and latency, so reserve this for tasks where accuracy matters more than speed.
Selective Pre-Translation: Don't Translate Everything
Another common mistake is translating the entire prompt into English before sending it to the model, or vice versa. Research from Bar-Ilan University, published in February 2025 by Itai Mondshine, Tzuf Paz-Argaman, and Reut Tsarfaty, shows that "selective pre-translation" works better than full translation or direct inference.
The key insight is that different parts of a prompt serve different purposes. You shouldn't treat them all the same way. The optimal strategy depends heavily on whether your task is extractive (pulling facts from text) or generative (creating new content).
| Task Type | Instructions | Context/Source Text | Examples | Output Language |
|---|---|---|---|---|
| Extractive (QA, NER) | Flexible (English or Source) | Keep in Source Language | Keep in Source Language | Source Language (except NER labels in Low-Resource langs, use English) |
| Generative (Summarization) | Translate to English | Translate to English | Translate to English | Generate in English, then MT back to Source |
For extractive tasks like Question Answering or Named Entity Recognition (NER), keeping the context in the source language is crucial. If you translate the source text, you might distort the exact spans needed for extraction. Similarly, providing few-shot examples in the source language gives the model a strong signal of how to behave in that specific linguistic context.
However, for generative tasks like summarization, the opposite is true. Translating everything to English, generating the summary in English, and then using a high-quality machine translation (MT) system to convert it back yields better results, especially for low-resource languages. This is because the model’s stylistic control and safety alignment are strongest in English.
Boosting Diversity and Reducing Hallucinations
Beyond accuracy, multilingual prompting can help you control the quality and diversity of the output. A 2025 arXiv study on "Multilingual Prompting for Improving LLM Generation Diversity" found that language choice is an active control dimension.
If you find your model repeating itself or hallucinating facts in a non-English language, try running the same query in multiple languages and aggregating the results. For instance, prompt the model in both English and the target language, compare the outputs, and select the most consistent answer. This technique reduces the model’s tendency to fabricate information because it has to align its response across different linguistic representations. It’s particularly useful for creative brainstorming or safety-oriented pipelines where factual integrity is paramount.
Industry Pipelines: The T-LM Approach
While individual prompts help, enterprises often need scalable solutions. Companies like ModernMT have deployed production systems such as T-LM (Translation-Layer Model). This approach embeds multilingual prompting into the workflow architecture.
In a T-LM pipeline:
- Pre-processing: User inputs in any language are translated into English using a domain-adapted MT engine.
- Reasoning: The English prompt is sent to a powerful English-optimized LLM (like GPT-4) to perform the task.
- Post-processing: The English output is translated back into the user’s original language.
This mirrors the XLT logic but externalizes the translation layer. It allows companies to reuse their existing English-centric workflows for customer support or knowledge access in dozens of languages. However, remember the warning from the selective pre-translation research: translation quality is critical. Poor MT quality will degrade your results, regardless of how good the LLM is. Always use terminology databases or glossaries to preserve domain consistency during the back-translation phase.
Practical Checklist for Better Non-English Outputs
So, what should you do today? Here is a quick decision tree to guide your prompt engineering efforts:
- Is the task Extractive (QA/NER)? Keep the source text and examples in the original language. Only translate instructions if necessary. For NER in low-resource languages, ask for entity labels in English to leverage stronger label semantics.
- Is the task Generative (Summary/Creative)? Translate instructions, context, and examples to English. Generate the output in English, then translate it back using a high-quality MT tool.
- Do you need complex reasoning? Use the XLT template. Explicitly ask the model to reason in English before outputting the target language.
- Are you seeing hallucinations? Try multi-language prompting. Run the query in English and the target language, then aggregate the best parts of both responses.
By treating language not just as a carrier of content but as a structural component of your prompt, you unlock significantly better performance. You don't need to wait for the next generation of fully multilingual models. With these techniques, you can get near-English quality in almost any language right now.
What is multilingual prompting?
Multilingual prompting is the practice of strategically choosing the languages and structure of instructions, context, and examples in prompts to Large Language Models (LLMs) to improve the quality, accuracy, and fluency of non-English outputs. It leverages the model's stronger English capabilities to bridge gaps in lower-resource languages.
Why do LLMs perform worse in non-English languages?
LLMs often underperform in non-English languages due to training data imbalances (less high-quality text available), tokenization inefficiencies for certain scripts, and poor representation of complex morphology and syntax in training corpora. This leads to lower accuracy and higher hallucination rates.
How does Cross-Lingual Thought Prompting (XLT) work?
XLT is a prompt template that instructs the LLM to act as an expert in the target language, internally translate the task to English, perform step-by-step reasoning in English, and then produce the final answer in the target language. This utilizes the model's superior English reasoning skills to boost performance in other languages.
Should I translate my entire prompt to English?
Not always. For extractive tasks like Question Answering, keeping the context and examples in the source language is better. For generative tasks like summarization, translating the entire prompt to English, generating in English, and then translating back often yields higher quality results, especially for low-resource languages.
Can multilingual prompting reduce hallucinations?
Yes. Research suggests that using multiple languages in prompts (e.g., querying in both English and the target language) and aggregating the results can reduce hallucination rates. This forces the model to align its factual claims across different linguistic representations, leading to more reliable outputs.