MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li,Yali Wang,Yinan He,Yizhuo Li,Yi Wang,Yi Liu,Zun Wang,Jilan Xu,Guo Chen,Ping Luo,Limin Wang,Yu Qiao

2024-05-23

Abstract:With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem this paper attempts to address is the inadequacy of existing multi-modal large language models (MLLMs) in evaluating dynamic video understanding capabilities. Specifically: - **Issues in evaluating dynamic understanding capabilities**: Most existing benchmarks primarily focus on static image tasks, neglecting temporal understanding in dynamic videos. Although some studies attempt to assess the temporal awareness of MLLMs in videos, these attempts are either limited to very basic tasks (such as action recognition and prediction) or focus on specific domains or scenes (such as indoor scenes). This results in an inability to comprehensively evaluate the temporal understanding skills of MLLMs. - **High cost of manual annotation**: Existing evaluation methods typically require a large amount of manual annotation, which is not only time-consuming but also costly. To address the above issues, the paper proposes a comprehensive multi-modal video understanding benchmark named MVBench, aimed at thoroughly evaluating the temporal awareness of MLLMs in an open world. It systematically defines time-related tasks based on the transition from static to dynamic and employs automated question-answer generation methods to reduce the cost of manual intervention, thereby improving the efficiency and fairness of the evaluation. Additionally, the paper develops a powerful video MLLM baseline model—VideoChat2, to fill the current gap in model performance regarding temporal understanding.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

LVBench: An Extreme Long Video Understanding Benchmark

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Towards Event-oriented Long Video Understanding

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs