MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

Amir Hossein Kargaran,Ali Modarressi,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze

2024-10-08

Abstract:English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: <a class="link-external link-https" href="https://huggingface.co/spaces/cis-lmu/Mexa" rel="external noopener nofollow">this https URL</a>, Code: <a class="link-external link-https" href="https://github.com/cisnlp/Mexa" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance capabilities of English - centered large language models (LLMs) in a multilingual environment. Currently, most evaluation benchmarks for LLMs mainly focus on English tasks or only cover a few languages, which has led to insufficient understanding of the performance of these models on other languages. To solve this problem, the paper proposes the MEXA method to evaluate the multilingual capabilities of pre - trained English - centered LLMs by using parallel sentences. MEXA takes advantage of the fact that these models use English as a kind of "pivot language" in the intermediate layers, calculates the alignment between English and other non - English languages, and thus estimates the performance capabilities of the model on other languages. This method can provide a clearer understanding of the multilingual potential of these models and their internal working mechanisms.

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

MELA: Multilingual Evaluation of Linguistic Acceptability

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

Towards Multilingual LLM Evaluation for European Languages

Extrapolating Large Language Models to Non-English by Aligning Languages

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Explicit Alignment Objectives for Multilingual Bidirectional Encoders

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds