Abstract:The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: <a class="link-external link-https" href="https://github.com/hshjerry/VideoEspresso" rel="external noopener nofollow">this https URL</a>

Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos

Cross-Modal Reasoning with Event Correlation for Video Question Answering

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Multi-object event graph representation learning for Video Question Answering

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Towards Neuro-Symbolic Video Understanding

Reasoning-Enhanced Object-Centric Learning for Videos

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Look Before you Speak: Visually Contextualized Utterances

Hybrid Reasoning Network for Video-based Commonsense Captioning

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

End-to-end Multi-modal Video Temporal Grounding

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Modular Action Concept Grounding in Semantic Video Prediction

MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning