Abstract:Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of multilingual long-context models in retrieval and reasoning tasks, particularly in handling multiple target sentences and languages with different resource levels. Specifically, the paper focuses on the following points: 1. **Evaluation of long-context models in a multilingual environment**: - Current evaluations of long-context models are primarily focused on English texts, lacking comprehensive assessments for other languages. - The paper evaluates multiple multilingual long-context models in five languages (English, Vietnamese, Indonesian, Swahili, and Somali) by creating a new dataset (mLongRR). 2. **Evaluation of different task complexities**: - The paper not only evaluates the retrieval task of single target sentences but also introduces reasoning tasks with multiple target sentences to test the models' ability to handle more complex tasks. 3. **Performance differences in languages with different resource levels**: - By selecting languages with different resource levels (from high-resource to extremely low-resource), the paper aims to explore how the level of language resources affects the models' performance in long-context tasks. 4. **Systematic comparison of model performance**: - The paper systematically evaluates the performance of six different long-context models (GPT-4, Gemini-1.5, Claude-3, YaRN-7b, Llama-3, and GPT-4o) across different languages and task complexities. 5. **Challenges and findings**: - The paper reveals the challenges current long-context models face when dealing with longer contexts, increasing the number of target sentences, and low-resource languages. - The results show that even in simple "needle in a haystack" tasks, current models exhibit limitations in multilingual environments. Through these evaluations, the paper hopes to provide valuable insights for further research and development of multilingual long-context models.

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Long-context LLMs Struggle with Long In-context Learning

Retrieval meets Long Context Large Language Models

Long Context RAG Performance of Large Language Models

Large Language Models Can Self-Improve in Long-context Reasoning

LooGLE: Can Long-Context Language Models Understand Long Contexts?

MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

RULER: What's the Real Context Size of Your Long-Context Language Models?

A Controlled Study on Long Context Extension and Generalization in LLMs

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Can Large Language Models Understand Context?

Language Models are Multilingual Chain-of-Thought Reasoners