MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue,Yuansheng Ni,Kai Zhang,Tianyu Zheng,Ruoqi Liu,Ge Zhang,Samuel Stevens,Dongfu Jiang,Weiming Ren,Yuxuan Sun,Cong Wei,Botao Yu,Ruibin Yuan,Renliang Sun,Ming Yin,Boyuan Zheng,Zhenzhu Yang,Yibo Liu,Wenhao Huang,Huan Sun,Yu Su,Wenhu Chen
2024-06-13
Abstract:We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a new benchmark called MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) to evaluate the performance of multimodal models on a large-scale interdisciplinary tasks that require college-level domain knowledge and deep reasoning. Existing benchmarks mainly focus on everyday knowledge and common-sense understanding, while MMMU consists of 11,500 carefully selected multimodal questions from university exams, tests, and textbooks, covering 6 core disciplines including arts & design, business, science, health & medicine, humanities & social sciences, and technology & engineering, with 30 different subjects and 183 subdisciplines. The paper points out that although existing multimodal models perform well on tasks like image question answering and visual reasoning, they still have limitations in tasks that require expert-level domain knowledge and complex reasoning. Therefore, MMMU proposes four challenges: comprehensiveness, highly heterogeneous image types, interleaved text and images, and expert-level perception and reasoning based on in-depth domain knowledge. In the paper, researchers evaluate the performance of 28 open-source and proprietary models (such as GPT-4V and Gemini) on MMMU. The results show that even the state-of-the-art model like GPT-4V only achieves an accuracy of 56%, indicating that there is still significant room for improvement on MMMU. Furthermore, the study analyzes error types and identifies perception errors, knowledge gaps, and reasoning defects as the major issues. The main contribution of MMMU is to drive the development of multimodal models towards achieving artificial general intelligence at a level closer to human experts, and it provides a tool for measuring progress towards this goal. While it is not a sufficient standard for measuring expert-level AGI, it emphasizes the need to enhance model capabilities in breadth and depth.