VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Yunxin Li,Xinyu Chen,Baotian Hu,Longyue Wang,Haoyuan Shi,Min Zhang
2024-06-17
Abstract:Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of a comprehensive benchmark for comprehensively evaluating the video understanding and reasoning capabilities of large - scale multimodal models (LMMs). Although significant progress has been made in the field of video analysis, these progresses are mainly concentrated on specific video question - answering tasks, such as ActivityNet, WildQA and MSRVTT - QA, etc. These benchmark datasets usually contain short video clips and focus more on understanding rather than reasoning. In addition, existing benchmark datasets such as MVBench, although they have collected videos for various specific tasks, their video sources are limited, and they mainly focus on the understanding and temporal reasoning of short videos, which limits the evaluation of long videos, diverse video categories and various video reasoning tasks. To meet this challenge, the authors propose VideoVista, a video question - answering benchmark dataset covering diverse content categories, durations and capabilities. VideoVista contains 3,400 videos selected from 14 different categories, with video durations ranging from a few seconds to more than 10 minutes, covering 19 understanding tasks and 8 reasoning tasks. By constructing such a comprehensive benchmark, the authors aim to promote the development of LMMs that can accurately understand video content and perform precise reasoning. Specifically, the goals of the paper are: 1. **Construct a multi - functional video question - answering benchmark**: This benchmark includes not only diverse video content and durations, but also designs a variety of understanding and reasoning tasks to comprehensively evaluate the capabilities of models. 2. **Develop an automatic data construction framework**: Utilize the powerful GPT - 4 model and other advanced analysis tools, such as video segmentation, object segmentation and tracking, to efficiently create large - scale training and evaluation datasets. 3. **Reveal the deficiencies of existing models**: Through extensive evaluation of cutting - edge models, the paper reveals the difficulties of existing video - related LMMs in handling long videos, fine - grained video tasks (such as temporal localization, object tracking and anomaly detection), and logical and relational reasoning capabilities. In particular, the performance of open - source models is far lower than that of commercial models such as GPT - 4o and Gemini - 1.5. Through these goals, the authors hope that VideoVista can become a key tool for promoting the development of video understanding and reasoning technologies.