MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Xuehai He,Weixi Feng,Kaizhi Zheng,Yujie Lu,Wanrong Zhu,Jiachen Li,Yue Fan,Jianfeng Wang,Linjie Li,Zhengyuan Yang,Kevin Lin,William Yang Wang,Lijuan Wang,Xin Eric Wang
2024-07-30
Abstract:Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of how to comprehensively evaluate the ability of multimodal large language models (MLLMs) to understand real-world dynamics. To achieve this, the paper proposes a novel benchmark framework—MMWorld, designed to assess these models' "world modeling" capabilities in the domain of video understanding. The main features of the MMWorld benchmark include: 1. **Interdisciplinary Coverage**: It encompasses a wide range of disciplines, including seven major fields such as Arts & Sports, Business, Science, Health & Medicine, Embodied Tasks, Technology & Engineering, and Games. These are further subdivided into 69 sub-disciplines to ensure the need for domain-specific expertise for comprehensive understanding. 2. **Multifaceted Reasoning**: Beyond perceptual understanding, it includes various types of reasoning abilities such as explaining phenomena in videos, engaging in counterfactual thinking (hypothetical reasoning), predicting future events, and solving problems using domain-specific knowledge. 3. **Comprehensive Evaluation Method**: It includes a human-annotated dataset for evaluating the model's understanding of entire videos and a synthetic dataset for analyzing the model's performance in single visual or audio modalities. By evaluating existing MLLMs, the paper finds that although these models excel in certain aspects, they still face significant challenges overall. For instance, even the best-performing model, GPT-4V, achieves an accuracy of only 52.3% on MMWorld, indicating substantial room for improvement. Additionally, the paper compares the performance of MLLMs with non-expert humans on questions of varying difficulty, revealing differences in cognitive and reasoning abilities between the two. In summary, MMWorld not only fills the gap in current evaluation frameworks regarding interdisciplinary and multifaceted reasoning capabilities but also provides a powerful tool for advancing MLLMs towards a more comprehensive understanding of the world.