Abstract:As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the evaluation issues of large language models (LLMs) in a Chinese context. Specifically, the paper introduces CMMLU (Chinese Massive Multitask Language Understanding), a comprehensive Chinese benchmark suite designed to evaluate the performance of LLMs across various tasks. #### Main Objectives: 1. **Filling the Evaluation Gap**: Most existing benchmarks are in English and cannot adequately assess the capabilities of LLMs in non-English contexts. CMMLU aims to fill this gap by providing an evaluation tool specifically tailored to the Chinese language and culture. 2. **Covering a Wide Range of Disciplines**: CMMLU spans multiple disciplines, including natural sciences, social sciences, engineering, and technology. This makes the evaluation more comprehensive, allowing for a multi-faceted examination of LLMs' capabilities. 3. **Assessing Actual Performance**: By conducting detailed evaluations of over 20 modern multilingual and Chinese LLMs, the paper reveals the performance of these models in different disciplinary fields and identifies their shortcomings. 4. **Exploring Improvement Directions**: Through experimental analysis of factors affecting model performance, the paper proposes directions for enhancing the performance of LLMs. ### Specific Contributions: 1. **Dataset Design**: CMMLU includes 67 different subject topics, each with at least 105 questions, covering professional knowledge from basic to advanced levels. 2. **Evaluation Method**: The paper uses multiple-choice questions, with each question having four options, to facilitate evaluation. Different evaluation strategies are employed for different types of models (e.g., commercial models and open-source models). 3. **Result Analysis**: The results show that most existing models have an accuracy rate of less than 60% in Chinese understanding, indicating significant room for improvement. Notably, GPT-4 performs the best among all models, with an average accuracy rate of 71%. Through the above work, the paper hopes to advance the development of Chinese language models and provide valuable references for future LLM research.

CMMLU: Measuring massive multitask language understanding in Chinese

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Measuring Taiwanese Mandarin Language Understanding

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Measuring Massive Multitask Chinese Understanding

ArcMMLU: A Library and Information Science Benchmark for Large Language Models

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

An Improved Traditional Chinese Evaluation Suite for Foundation Model

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

C^3Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models