VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han,Wei Huang,Hairong Shi,Le Zhuo,Xiu Su,Shifeng Zhang,Xu Zhou,Xiaojuan Qi,Yue Liao,Si Liu
2024-11-22
Abstract:The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: <a class="link-external link-https" href="https://github.com/hshjerry/VideoEspresso" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiency of existing VideoQA (Video Question Answering) datasets in high - quality and large - scale data generation, especially the lack of fine - grained spatio - temporal details and complex reasoning capabilities. Specifically: 1. **Dataset Quality and Scale**: Existing VideoQA datasets usually rely on costly manual annotations, which often lack necessary details, limiting the extensibility of the dataset and the effectiveness of complex reasoning tasks. Moreover, although QA pairs can be automatically generated using large - language models (LLMs), this method is usually based on video metadata or rough information, resulting in a lack of crucial video details and being unable to support fine - grained reasoning. 2. **Video Redundancy and Information Extraction**: Video content is often highly redundant, and key information is scattered, making frame - by - frame analysis both costly and prone to information overload. Therefore, how to effectively reduce redundancy while retaining important information has become a key challenge in constructing high - quality VideoQA datasets. To solve the above problems, the paper proposes VideoEspresso, a new large - scale VideoQA dataset. It generates high - quality QA pairs containing core spatio - temporal details through an automated pipeline and introduces multi - modal Chain - of - Thought (CoT) annotations to enhance the reasoning ability of the model. Specific methods include: - **Semantic - aware Key Information Extraction**: Map video frames to the language space and remove redundant frames based on semantic similarity to retain core information. - **QA Pair Generation for Multi - frame Descriptions**: Use large - language models such as GPT - 4 to generate initial QA pairs and ensure that the generated QA pairs contain complex reasoning processes through carefully designed prompt words. - **Multi - modal Chain - of - Thought Annotations**: Generate multi - modal CoT data pairs by annotating key objects and their spatio - temporal relationships, further enriching the reasoning process and improving the interpretability and accuracy of the model. Through these methods, VideoEspresso not only provides high - quality VideoQA data but also provides strong support for complex video reasoning tasks.