Abstract:Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a comprehensive benchmark for comprehensively evaluating the video understanding and reasoning capabilities of large - scale multimodal models (LMMs). Although significant progress has been made in the field of video analysis, these progresses are mainly concentrated on specific video question - answering tasks, such as ActivityNet, WildQA and MSRVTT - QA, etc. These benchmark datasets usually contain short video clips and focus more on understanding rather than reasoning. In addition, existing benchmark datasets such as MVBench, although they have collected videos for various specific tasks, their video sources are limited, and they mainly focus on the understanding and temporal reasoning of short videos, which limits the evaluation of long videos, diverse video categories and various video reasoning tasks. To meet this challenge, the authors propose VideoVista, a video question - answering benchmark dataset covering diverse content categories, durations and capabilities. VideoVista contains 3,400 videos selected from 14 different categories, with video durations ranging from a few seconds to more than 10 minutes, covering 19 understanding tasks and 8 reasoning tasks. By constructing such a comprehensive benchmark, the authors aim to promote the development of LMMs that can accurately understand video content and perform precise reasoning. Specifically, the goals of the paper are: 1. **Construct a multi - functional video question - answering benchmark**: This benchmark includes not only diverse video content and durations, but also designs a variety of understanding and reasoning tasks to comprehensively evaluate the capabilities of models. 2. **Develop an automatic data construction framework**: Utilize the powerful GPT - 4 model and other advanced analysis tools, such as video segmentation, object segmentation and tracking, to efficiently create large - scale training and evaluation datasets. 3. **Reveal the deficiencies of existing models**: Through extensive evaluation of cutting - edge models, the paper reveals the difficulties of existing video - related LMMs in handling long videos, fine - grained video tasks (such as temporal localization, object tracking and anomaly detection), and logical and relational reasoning capabilities. In particular, the performance of open - source models is far lower than that of commercial models such as GPT - 4o and Gemini - 1.5. Through these goals, the authors hope that VideoVista can become a key tool for promoting the development of video understanding and reasoning technologies.

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

TVBench: Redesigning Video-Language Evaluation

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Towards Event-oriented Long Video Understanding

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

VideoQA in the Era of LLMs: An Empirical Study

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

LVBench: An Extreme Long Video Understanding Benchmark

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

ViLLa: Video Reasoning Segmentation with Large Language Model