CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang,Xinrun Du,Bei Chen,Yiming Liang,Tongxu Luo,Tianyu Zheng,Kang Zhu,Yuyang Cheng,Chunpu Xu,Shuyue Guo,Haoran Zhang,Xingwei Qu,Junjie Wang,Ruibin Yuan,Yizhi Li,Zekun Wang,Yudong Liu,Yu-Hsuan Tsai,Fengji Zhang,Chenghua Lin,Wenhao Huang,Wenhu Chen,Jie Fu
2024-03-18
Abstract:As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the lack of evaluation of advanced knowledge and reasoning abilities of large multimodal models (LMMs) in non-English environments (such as Chinese). Specifically, the paper proposes a new benchmark called CMMMU to evaluate the performance of large multimodal models in complex perception and reasoning tasks that require university-level subject knowledge. ### Main Contributions 1. **Proposing the CMMMU Benchmark**: - CMMMU is the first large-scale, multidisciplinary, multimodal understanding benchmark in Chinese. - The benchmark includes 12,000 multimodal questions manually collected from university exams, quizzes, and textbooks, covering six core subjects (Arts and Design, Business, Science, Health and Medicine, Humanities and Social Sciences, Technology and Engineering), involving 30 sub-disciplines and 39 highly heterogeneous image types. 2. **Revealing the Performance Deficiency of Existing LMMs**: - Even the most advanced closed-source model GPT-4V has an accuracy of only 42% on CMMMU, indicating that existing models still have significant room for improvement in complex reasoning and understanding abilities in Chinese environments. 3. **Evaluating the Gap Between Open-Source and Closed-Source LMMs**: - In the Chinese environment, the gap between open-source bilingual LMMs and the most powerful closed-source LMMs is significantly smaller than in the English environment. For example, the most powerful open-source model Yi-VL-34B has an accuracy of 36% on CMMMU, with only a 7% gap compared to GPT-4V. ### Methodology 1. **Data Collection and Annotation**: - Ensuring the quality and diversity of data through a three-stage data collection and annotation process. - Stage 1: Collecting sources that meet licensing requirements. - Stage 2: Further annotating data by crowdsourced annotators. - Stage 3: Supplementing subjects lacking questions to balance the dataset. 2. **Data Quality Control**: - Each question is manually verified by at least one author. - Filtering out questions that can be correctly solved through OCR to avoid data contamination. 3. **Model Evaluation**: - Evaluating various models using a zero-shot setting, including both open-source and closed-source LMMs and LLMs. - Evaluation metrics include micro-average accuracy and performance breakdown by different types of questions and difficulty levels. ### Experimental Results - **Overall Performance**: - GPT-4V has an accuracy of 42.5% on CMMMU, while the most advanced open-source model Yi-VL-34B has an accuracy of 36.2%. - Open-source models show varying performance on multiple-choice, fill-in-the-blank, and true/false questions, with a larger gap compared to GPT-4V on medium and high-difficulty questions. - **Error Analysis**: - Analyzing 150 incorrect answers from GPT-4V, the main error types include perception errors, lack of knowledge, reasoning errors, refusal to answer, and annotation errors. - Perception errors account for 26%, mainly due to the model's bias in understanding and interpreting arrows, symbols, and sequences in images. ### Conclusion The CMMMU benchmark provides an important tool for evaluating and improving the complex reasoning and understanding abilities of large multimodal models in Chinese environments. Although existing models perform well on certain tasks, there is still significant room for improvement, especially in computational and reasoning abilities under complex conditions. This benchmark is expected to promote further development in the Chinese multimodal AGI field within the open-source community.