Abstract:Large language models (LLMs) are commonly used as evaluators in tasks (e.g., reward modeling, LLM-as-a-judge), where they act as proxies for human preferences or judgments. This leads to the need for meta-evaluation: evaluating the credibility of LLMs as evaluators. However, existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts. To address this, we introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories. MM-Eval evaluates various dimensions, including language-specific challenges like linguistics and language hallucinations. Evaluation results show that both proprietary and open-source language models have considerable room for improvement. Further analysis reveals a tendency for these models to assign middle-ground scores to low-resource languages. We publicly release our benchmark and code.

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the effectiveness and reliability of large language models (LLMs) as evaluators in multilingual environments. Specifically, existing benchmarks mainly focus on English, and there is limited understanding of LLM evaluation performance in non-English settings. To fill this gap, the authors introduce MM-EVAL, a multilingual meta-evaluation benchmark covering 18 languages, aimed at assessing LLM performance in different linguistic contexts. ### Main Issues: 1. **Limitations of Existing Benchmarks**: Existing meta-evaluation benchmarks primarily focus on English, failing to comprehensively assess LLM performance in non-English environments. 2. **Evaluation Challenges in Multilingual Environments**: Different languages have distinct grammar, vocabulary, and cultural characteristics that may affect LLM evaluation capabilities. 3. **Performance in Low-Resource Languages**: How LLMs perform in low-resource languages and whether there are systematic biases. ### Solutions: - **MM-EVAL Benchmark**: Covers 18 languages, including low-resource languages such as Swahili, Basque, and Galician. MM-EVAL includes six subsets: chat, reasoning, safety, linguistic hallucination, linguistics, and language resources. - **Multidimensional Evaluation**: Assesses multiple aspects, including language-specific challenges such as linguistics and linguistic hallucination. - **Public Release**: The benchmark and code are publicly released for the research community to use and improve. ### Key Findings: - **Overall Performance**: 12 LLMs (including proprietary and open-source models) have an average accuracy of 68.9% on MM-EVAL, indicating significant room for improvement. - **Performance in Low-Resource Languages**: In low-resource languages, LLMs tend to give lower scores to high-quality responses and higher scores to low-quality responses, failing to clearly distinguish between good and bad responses. - **Performance Differences in Specific Tasks**: Different models show significant performance variations across tasks. For example, Self-Taught-Evaluator-Llama3.1-70B performs well on the linguistic hallucination task but poorly on chat and linguistics tasks. These findings highlight the importance of constructing multilingual meta-evaluation benchmarks and point out the current shortcomings of LLMs in multilingual environments.

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

Towards Multilingual LLM Evaluation for European Languages

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

METAL: Towards Multilingual Meta-Evaluation

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena