Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Ameeta Agrawal,Andy Dang,Sina Bagheri Nezhad,Rhitabrat Pokharel,Russell Scheinberg
2024-10-13
Abstract:Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of multilingual long-context models in retrieval and reasoning tasks, particularly in handling multiple target sentences and languages with different resource levels. Specifically, the paper focuses on the following points: 1. **Evaluation of long-context models in a multilingual environment**: - Current evaluations of long-context models are primarily focused on English texts, lacking comprehensive assessments for other languages. - The paper evaluates multiple multilingual long-context models in five languages (English, Vietnamese, Indonesian, Swahili, and Somali) by creating a new dataset (mLongRR). 2. **Evaluation of different task complexities**: - The paper not only evaluates the retrieval task of single target sentences but also introduces reasoning tasks with multiple target sentences to test the models' ability to handle more complex tasks. 3. **Performance differences in languages with different resource levels**: - By selecting languages with different resource levels (from high-resource to extremely low-resource), the paper aims to explore how the level of language resources affects the models' performance in long-context tasks. 4. **Systematic comparison of model performance**: - The paper systematically evaluates the performance of six different long-context models (GPT-4, Gemini-1.5, Claude-3, YaRN-7b, Llama-3, and GPT-4o) across different languages and task complexities. 5. **Challenges and findings**: - The paper reveals the challenges current long-context models face when dealing with longer contexts, increasing the number of target sentences, and low-resource languages. - The results show that even in simple "needle in a haystack" tasks, current models exhibit limitations in multilingual environments. Through these evaluations, the paper hopes to provide valuable insights for further research and development of multilingual long-context models.