CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li,Yixuan Zhang,Fajri Koto,Yifei Yang,Hai Zhao,Yeyun Gong,Nan Duan,Timothy Baldwin
2024-01-18
Abstract:As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the evaluation issues of large language models (LLMs) in a Chinese context. Specifically, the paper introduces CMMLU (Chinese Massive Multitask Language Understanding), a comprehensive Chinese benchmark suite designed to evaluate the performance of LLMs across various tasks. #### Main Objectives: 1. **Filling the Evaluation Gap**: Most existing benchmarks are in English and cannot adequately assess the capabilities of LLMs in non-English contexts. CMMLU aims to fill this gap by providing an evaluation tool specifically tailored to the Chinese language and culture. 2. **Covering a Wide Range of Disciplines**: CMMLU spans multiple disciplines, including natural sciences, social sciences, engineering, and technology. This makes the evaluation more comprehensive, allowing for a multi-faceted examination of LLMs' capabilities. 3. **Assessing Actual Performance**: By conducting detailed evaluations of over 20 modern multilingual and Chinese LLMs, the paper reveals the performance of these models in different disciplinary fields and identifies their shortcomings. 4. **Exploring Improvement Directions**: Through experimental analysis of factors affecting model performance, the paper proposes directions for enhancing the performance of LLMs. ### Specific Contributions: 1. **Dataset Design**: CMMLU includes 67 different subject topics, each with at least 105 questions, covering professional knowledge from basic to advanced levels. 2. **Evaluation Method**: The paper uses multiple-choice questions, with each question having four options, to facilitate evaluation. Different evaluation strategies are employed for different types of models (e.g., commercial models and open-source models). 3. **Result Analysis**: The results show that most existing models have an accuracy rate of less than 60% in Chinese understanding, indicating significant room for improvement. Notably, GPT-4 performs the best among all models, with an average accuracy rate of 71%. Through the above work, the paper hopes to advance the development of Chinese language models and provide valuable references for future LLM research.