MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue,Yuansheng Ni,Kai Zhang,Tianyu Zheng,Ruoqi Liu,Ge Zhang,Samuel Stevens,Dongfu Jiang,Weiming Ren,Yuxuan Sun,Cong Wei,Botao Yu,Ruibin Yuan,Renliang Sun,Ming Yin,Boyuan Zheng,Zhenzhu Yang,Yibo Liu,Wenhao Huang,Huan Sun,Yu Su,Wenhu Chen

2024-06-13

Abstract:We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a new benchmark called MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) to evaluate the performance of multimodal models on a large-scale interdisciplinary tasks that require college-level domain knowledge and deep reasoning. Existing benchmarks mainly focus on everyday knowledge and common-sense understanding, while MMMU consists of 11,500 carefully selected multimodal questions from university exams, tests, and textbooks, covering 6 core disciplines including arts & design, business, science, health & medicine, humanities & social sciences, and technology & engineering, with 30 different subjects and 183 subdisciplines. The paper points out that although existing multimodal models perform well on tasks like image question answering and visual reasoning, they still have limitations in tasks that require expert-level domain knowledge and complex reasoning. Therefore, MMMU proposes four challenges: comprehensiveness, highly heterogeneous image types, interleaved text and images, and expert-level perception and reasoning based on in-depth domain knowledge. In the paper, researchers evaluate the performance of 28 open-source and proprietary models (such as GPT-4V and Gemini) on MMMU. The results show that even the state-of-the-art model like GPT-4V only achieves an accuracy of 56%, indicating that there is still significant room for improvement on MMMU. Furthermore, the study analyzes error types and identifies perception errors, knowledge gaps, and reasoning defects as the major issues. The main contribution of MMMU is to drive the development of multimodal models towards achieving artificial general intelligence at a level closer to human experts, and it provides a tool for measuring progress towards this goal. While it is not a sufficient standard for measuring expert-level AGI, it emphasizes the need to enhance model capabilities in breadth and depth.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MULTI: Multimodal Understanding Leaderboard with Text and Images

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

CMMLU: Measuring massive multitask language understanding in Chinese

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria