Abstract:Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

What problem does this paper attempt to address?

The paper aims to address the issue of how to comprehensively evaluate the ability of multimodal large language models (MLLMs) to understand real-world dynamics. To achieve this, the paper proposes a novel benchmark framework—MMWorld, designed to assess these models' "world modeling" capabilities in the domain of video understanding. The main features of the MMWorld benchmark include: 1. **Interdisciplinary Coverage**: It encompasses a wide range of disciplines, including seven major fields such as Arts & Sports, Business, Science, Health & Medicine, Embodied Tasks, Technology & Engineering, and Games. These are further subdivided into 69 sub-disciplines to ensure the need for domain-specific expertise for comprehensive understanding. 2. **Multifaceted Reasoning**: Beyond perceptual understanding, it includes various types of reasoning abilities such as explaining phenomena in videos, engaging in counterfactual thinking (hypothetical reasoning), predicting future events, and solving problems using domain-specific knowledge. 3. **Comprehensive Evaluation Method**: It includes a human-annotated dataset for evaluating the model's understanding of entire videos and a synthetic dataset for analyzing the model's performance in single visual or audio modalities. By evaluating existing MLLMs, the paper finds that although these models excel in certain aspects, they still face significant challenges overall. For instance, even the best-performing model, GPT-4V, achieves an accuracy of only 52.3% on MMWorld, indicating substantial room for improvement. Additionally, the paper compares the performance of MLLMs with non-expert humans on questions of varying difficulty, revealing differences in cognitive and reasoning abilities between the two. In summary, MMWorld not only fills the gap in current evaluation frameworks regarding interdisciplinary and multifaceted reasoning capabilities but also provides a powerful tool for advancing MLLMs towards a more comprehensive understanding of the world.

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

WorldGPT: Empowering LLM as Multimodal World Model

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

A Survey on Benchmarks of Multimodal Large Language Models

From Efficient Multimodal Models to World Models: A Survey

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

MMBench: Is Your Multi-modal Model an All-around Player?

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MULTI: Multimodal Understanding Leaderboard with Text and Images

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models