What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the lack of evaluation of advanced knowledge and reasoning abilities of large multimodal models (LMMs) in non-English environments (such as Chinese). Specifically, the paper proposes a new benchmark called CMMMU to evaluate the performance of large multimodal models in complex perception and reasoning tasks that require university-level subject knowledge. ### Main Contributions 1. **Proposing the CMMMU Benchmark**: - CMMMU is the first large-scale, multidisciplinary, multimodal understanding benchmark in Chinese. - The benchmark includes 12,000 multimodal questions manually collected from university exams, quizzes, and textbooks, covering six core subjects (Arts and Design, Business, Science, Health and Medicine, Humanities and Social Sciences, Technology and Engineering), involving 30 sub-disciplines and 39 highly heterogeneous image types. 2. **Revealing the Performance Deficiency of Existing LMMs**: - Even the most advanced closed-source model GPT-4V has an accuracy of only 42% on CMMMU, indicating that existing models still have significant room for improvement in complex reasoning and understanding abilities in Chinese environments. 3. **Evaluating the Gap Between Open-Source and Closed-Source LMMs**: - In the Chinese environment, the gap between open-source bilingual LMMs and the most powerful closed-source LMMs is significantly smaller than in the English environment. For example, the most powerful open-source model Yi-VL-34B has an accuracy of 36% on CMMMU, with only a 7% gap compared to GPT-4V. ### Methodology 1. **Data Collection and Annotation**: - Ensuring the quality and diversity of data through a three-stage data collection and annotation process. - Stage 1: Collecting sources that meet licensing requirements. - Stage 2: Further annotating data by crowdsourced annotators. - Stage 3: Supplementing subjects lacking questions to balance the dataset. 2. **Data Quality Control**: - Each question is manually verified by at least one author. - Filtering out questions that can be correctly solved through OCR to avoid data contamination. 3. **Model Evaluation**: - Evaluating various models using a zero-shot setting, including both open-source and closed-source LMMs and LLMs. - Evaluation metrics include micro-average accuracy and performance breakdown by different types of questions and difficulty levels. ### Experimental Results - **Overall Performance**: - GPT-4V has an accuracy of 42.5% on CMMMU, while the most advanced open-source model Yi-VL-34B has an accuracy of 36.2%. - Open-source models show varying performance on multiple-choice, fill-in-the-blank, and true/false questions, with a larger gap compared to GPT-4V on medium and high-difficulty questions. - **Error Analysis**: - Analyzing 150 incorrect answers from GPT-4V, the main error types include perception errors, lack of knowledge, reasoning errors, refusal to answer, and annotation errors. - Perception errors account for 26%, mainly due to the model's bias in understanding and interpreting arrows, symbols, and sequences in images. ### Conclusion The CMMMU benchmark provides an important tool for evaluating and improving the complex reasoning and understanding abilities of large multimodal models in Chinese environments. Although existing models perform well on certain tasks, there is still significant room for improvement, especially in computational and reasoning abilities under complex conditions. This benchmark is expected to promote further development in the Chinese multimodal AGI field within the open-source community.

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

CMMLU: Measuring massive multitask language understanding in Chinese

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

MMBench: Is Your Multi-modal Model an All-around Player?

MULTI: Multimodal Understanding Leaderboard with Text and Images

A Survey on Benchmarks of Multimodal Large Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models