Abstract:Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks, i.e., their ability to differentiate between models being evaluated. Leveraging this pipeline, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models, analyze dataset effectiveness, examine prompt impacts on model performances, and explore the relationship between multilingual performances and factors such as tasks, model sizes, and languages. These insights offer valuable guidance for future research. The dataset is available at <a class="link-external link-https" href="https://huggingface.co/datasets/Qwen/P-MMEval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key problems in the evaluation of multilingual capabilities of current large - scale language models (LLMs): 1. **Comprehensive evaluation of multilingual capabilities**: - Current evaluations of LLMs are often limited to basic natural language processing (NLP) tasks or tasks with specific capabilities, lacking a comprehensive multilingual and multitask benchmark. - Existing evaluation methods usually focus only on English data and cannot comprehensively evaluate the performance of models on multiple languages. 2. **Consistency of multilingual coverage**: - Existing multitask benchmarks have the problem of inconsistent language coverage on different datasets, leading to unfairness and inconsistency in cross - language evaluation. 3. **Selection of effective datasets**: - Selecting datasets that can effectively distinguish the performance of different models is a challenge, and existing benchmarks often lack a systematic evaluation of dataset utility. ### Solutions To address the above problems, the authors propose a comprehensive multilingual and multitask benchmark - P - MME VAL. Specifically, this benchmark has the following characteristics: 1. **Dataset selection pipeline**: - Through statistical analysis methods, datasets that can effectively distinguish the performance of different models are screened from a large number of existing datasets. - This method improves the objectivity and scientific rigor of dataset selection. 2. **Covering a wide range of tasks**: - P - MME VAL includes three basic NLP datasets and five advanced - ability - specific datasets, covering two major categories of tasks: generation and understanding. - Basic NLP tasks include natural language inference, common - sense reasoning, synonymous sentence recognition, word - sense disambiguation, question answering, etc. - Advanced - ability - specific tasks include code generation, knowledge understanding, mathematical reasoning, logical reasoning, and instruction following. 3. **Consistent language coverage**: - Ten languages are unified (including English, Chinese, Arabic, Spanish, Japanese, Korean, Thai, French, Portuguese, and Vietnamese) to ensure consistent coverage of all selected datasets on these languages. - Through expert translation review, missing multilingual parts are supplemented to ensure translation quality and cultural adaptability. 4. **Extensive experimental verification**: - Extensive experiments are carried out on multiple representative multilingual model series to compare the performance of different models under various tasks, languages, and prompts. - The effectiveness of datasets, the influence of prompts on model performance, and the relationships between multilingual performance and tasks, model size, and languages are analyzed. ### Main contributions 1. **Dataset selection pipeline**: - A method based on statistical analysis is proposed for selecting effective datasets, improving the objectivity and scientific nature of evaluation. 2. **P - MME VAL benchmark**: - A comprehensive multilingual and multitask benchmark is constructed, ensuring the consistency of language coverage and the fairness of cross - language evaluation. 3. **Comprehensive experimental analysis**: - A comprehensive analysis of different models in terms of multilingual capabilities is provided, showing performance differences under different prompts, models, languages, and tasks. - The utility of each dataset in distinguishing model performance is analyzed, providing valuable guidance for future research. Through these methods and contributions, P - MME VAL provides a more comprehensive, consistent, and effective benchmark for the evaluation of multilingual LLMs.

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

McEval: Massively Multilingual Code Evaluation

CMMLU: Measuring massive multitask language understanding in Chinese

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Benchmarks of Multimodal Large Language Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MELA: Multilingual Evaluation of Linguistic Acceptability

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

MileBench: Benchmarking MLLMs in Long Context

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge