P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

Yidan Zhang,Boyi Deng,Yu Wan,Baosong Yang,Haoran Wei,Fei Huang,Bowen Yu,Junyang Lin,Fei Huang,Jingren Zhou
2024-11-14
Abstract:Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks, i.e., their ability to differentiate between models being evaluated. Leveraging this pipeline, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models, analyze dataset effectiveness, examine prompt impacts on model performances, and explore the relationship between multilingual performances and factors such as tasks, model sizes, and languages. These insights offer valuable guidance for future research. The dataset is available at <a class="link-external link-https" href="https://huggingface.co/datasets/Qwen/P-MMEval" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key problems in the evaluation of multilingual capabilities of current large - scale language models (LLMs): 1. **Comprehensive evaluation of multilingual capabilities**: - Current evaluations of LLMs are often limited to basic natural language processing (NLP) tasks or tasks with specific capabilities, lacking a comprehensive multilingual and multitask benchmark. - Existing evaluation methods usually focus only on English data and cannot comprehensively evaluate the performance of models on multiple languages. 2. **Consistency of multilingual coverage**: - Existing multitask benchmarks have the problem of inconsistent language coverage on different datasets, leading to unfairness and inconsistency in cross - language evaluation. 3. **Selection of effective datasets**: - Selecting datasets that can effectively distinguish the performance of different models is a challenge, and existing benchmarks often lack a systematic evaluation of dataset utility. ### Solutions To address the above problems, the authors propose a comprehensive multilingual and multitask benchmark - P - MME VAL. Specifically, this benchmark has the following characteristics: 1. **Dataset selection pipeline**: - Through statistical analysis methods, datasets that can effectively distinguish the performance of different models are screened from a large number of existing datasets. - This method improves the objectivity and scientific rigor of dataset selection. 2. **Covering a wide range of tasks**: - P - MME VAL includes three basic NLP datasets and five advanced - ability - specific datasets, covering two major categories of tasks: generation and understanding. - Basic NLP tasks include natural language inference, common - sense reasoning, synonymous sentence recognition, word - sense disambiguation, question answering, etc. - Advanced - ability - specific tasks include code generation, knowledge understanding, mathematical reasoning, logical reasoning, and instruction following. 3. **Consistent language coverage**: - Ten languages are unified (including English, Chinese, Arabic, Spanish, Japanese, Korean, Thai, French, Portuguese, and Vietnamese) to ensure consistent coverage of all selected datasets on these languages. - Through expert translation review, missing multilingual parts are supplemented to ensure translation quality and cultural adaptability. 4. **Extensive experimental verification**: - Extensive experiments are carried out on multiple representative multilingual model series to compare the performance of different models under various tasks, languages, and prompts. - The effectiveness of datasets, the influence of prompts on model performance, and the relationships between multilingual performance and tasks, model size, and languages are analyzed. ### Main contributions 1. **Dataset selection pipeline**: - A method based on statistical analysis is proposed for selecting effective datasets, improving the objectivity and scientific nature of evaluation. 2. **P - MME VAL benchmark**: - A comprehensive multilingual and multitask benchmark is constructed, ensuring the consistency of language coverage and the fairness of cross - language evaluation. 3. **Comprehensive experimental analysis**: - A comprehensive analysis of different models in terms of multilingual capabilities is provided, showing performance differences under different prompts, models, languages, and tasks. - The utility of each dataset in distinguishing model performance is analyzed, providing valuable guidance for future research. Through these methods and contributions, P - MME VAL provides a more comprehensive, consistent, and effective benchmark for the evaluation of multilingual LLMs.