Abstract:The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: <a class="link-external link-https" href="https://github.com/hshjerry/VideoEspresso" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiency of existing VideoQA (Video Question Answering) datasets in high - quality and large - scale data generation, especially the lack of fine - grained spatio - temporal details and complex reasoning capabilities. Specifically: 1. **Dataset Quality and Scale**: Existing VideoQA datasets usually rely on costly manual annotations, which often lack necessary details, limiting the extensibility of the dataset and the effectiveness of complex reasoning tasks. Moreover, although QA pairs can be automatically generated using large - language models (LLMs), this method is usually based on video metadata or rough information, resulting in a lack of crucial video details and being unable to support fine - grained reasoning. 2. **Video Redundancy and Information Extraction**: Video content is often highly redundant, and key information is scattered, making frame - by - frame analysis both costly and prone to information overload. Therefore, how to effectively reduce redundancy while retaining important information has become a key challenge in constructing high - quality VideoQA datasets. To solve the above problems, the paper proposes VideoEspresso, a new large - scale VideoQA dataset. It generates high - quality QA pairs containing core spatio - temporal details through an automated pipeline and introduces multi - modal Chain - of - Thought (CoT) annotations to enhance the reasoning ability of the model. Specific methods include: - **Semantic - aware Key Information Extraction**: Map video frames to the language space and remove redundant frames based on semantic similarity to retain core information. - **QA Pair Generation for Multi - frame Descriptions**: Use large - language models such as GPT - 4 to generate initial QA pairs and ensure that the generated QA pairs contain complex reasoning processes through carefully designed prompt words. - **Multi - modal Chain - of - Thought Annotations**: Generate multi - modal CoT data pairs by annotating key objects and their spatio - temporal relationships, further enriching the reasoning process and improving the interpretability and accuracy of the model. Through these methods, VideoEspresso not only provides high - quality VideoQA data but also provides strong support for complex video reasoning tasks.

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

ViLLa: Video Reasoning Segmentation with Large Language Model

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Look, Remember and Reason: Grounded reasoning in videos with language models

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

VideoLLM: Modeling Video Sequence with Large Language Models

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation