MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Guijin Son,Dongkeun Yoon,Juyoung Suk,Javier Aula-Blasco,Mano Aslan,Vu Trong Kim,Shayekh Bin Islam,Jaume Prats-Cristià,Lucía Tormo-Bañuelos,Seungone Kim
2024-10-23
Abstract:Large language models (LLMs) are commonly used as evaluators in tasks (e.g., reward modeling, LLM-as-a-judge), where they act as proxies for human preferences or judgments. This leads to the need for meta-evaluation: evaluating the credibility of LLMs as evaluators. However, existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts. To address this, we introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories. MM-Eval evaluates various dimensions, including language-specific challenges like linguistics and language hallucinations. Evaluation results show that both proprietary and open-source language models have considerable room for improvement. Further analysis reveals a tendency for these models to assign middle-ground scores to low-resource languages. We publicly release our benchmark and code.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the effectiveness and reliability of large language models (LLMs) as evaluators in multilingual environments. Specifically, existing benchmarks mainly focus on English, and there is limited understanding of LLM evaluation performance in non-English settings. To fill this gap, the authors introduce MM-EVAL, a multilingual meta-evaluation benchmark covering 18 languages, aimed at assessing LLM performance in different linguistic contexts. ### Main Issues: 1. **Limitations of Existing Benchmarks**: Existing meta-evaluation benchmarks primarily focus on English, failing to comprehensively assess LLM performance in non-English environments. 2. **Evaluation Challenges in Multilingual Environments**: Different languages have distinct grammar, vocabulary, and cultural characteristics that may affect LLM evaluation capabilities. 3. **Performance in Low-Resource Languages**: How LLMs perform in low-resource languages and whether there are systematic biases. ### Solutions: - **MM-EVAL Benchmark**: Covers 18 languages, including low-resource languages such as Swahili, Basque, and Galician. MM-EVAL includes six subsets: chat, reasoning, safety, linguistic hallucination, linguistics, and language resources. - **Multidimensional Evaluation**: Assesses multiple aspects, including language-specific challenges such as linguistics and linguistic hallucination. - **Public Release**: The benchmark and code are publicly released for the research community to use and improve. ### Key Findings: - **Overall Performance**: 12 LLMs (including proprietary and open-source models) have an average accuracy of 68.9% on MM-EVAL, indicating significant room for improvement. - **Performance in Low-Resource Languages**: In low-resource languages, LLMs tend to give lower scores to high-quality responses and higher scores to low-quality responses, failing to clearly distinguish between good and bad responses. - **Performance Differences in Specific Tasks**: Different models show significant performance variations across tasks. For example, Self-Taught-Evaluator-Llama3.1-70B performs well on the linguistic hallucination task but poorly on chat and linguistics tasks. These findings highlight the importance of constructing multilingual meta-evaluation benchmarks and point out the current shortcomings of LLMs in multilingual environments.