Abstract:In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large-scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain texts. It evaluates models on both spatiotemporal grounding and reasoning, fostering to address complex challenges in motion-related video reasoning, temporal perception, and pixel-level understanding. Furthermore, we introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA). MORA incorporates the multimodal reasoning ability from the Multimodal LLM, the pixel-level perception capability from the grounding model (SAM), and the temporal perception ability from a lightweight localization head. MORA achieves respectable performance on GROUNDMORE outperforming the best existing visual grounding baseline model by an average of 21.5% relatively. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of motion understanding in dynamic scenes in videos. In particular, by introducing a new task - **Motion - Grounded Video Reasoning (MGVR)**, it evaluates the reasoning and perception abilities of multimodal models in terms of motion understanding. Specifically, the MGVR task requires the model to generate visual answers (video segmentation masks) based on the input questions, which requires the model to have the ability of implicit spatio - temporal reasoning and localization. ### Task definition The basic definition of the **Motion - Grounded Video Reasoning** task is as follows: - **Input**: A video clip \( V\in\mathbb{R}^{t\times h\times w\times3}\) (where \( t\), \( w\), \( h\) and 3 represent the video length, width, height and number of channels respectively), and a motion - related question \( Q\). - **Output**: A binary object segmentation mask \( M\in\mathbb{R}^{t'\times h\times w}\) related to motion (where \( t'\leq t\)). ### Task challenges The main challenges of this task are: 1. **Motion - related reasoning ability**: The model needs to understand the relationship between the target motion and its spatio - temporal context. For example, in the video, "The girl took out the dog food from the cabinet and then fed the dog". For the motion "feed", to fully understand this concept, it is necessary to perceive its spatial context "the girl" and "the dog food", as well as its temporal context "took out the dog food from the cabinet". 2. **Pixel - level understanding ability of moving objects**: The model not only needs to reason out the answer, but also needs to generate a series of spatio - temporal masks to represent the answer, because relying solely on language output cannot avoid biases (for example, in common ball - related videos, when asked about the action "play", existing QA models tend to answer "ball" even without visual cues). This visual response method can ensure whether the model is aware of the time when the motion occurs and the objects involved. ### Dataset To support this new task, the authors collected a large - scale dataset **GROUND MORE** with the following characteristics: - **Number of videos**: 1,715 video clips. - **Question types**: 4 types of questions (causal, sequential, counterfactual and descriptive). - **Object masks**: 249,500 object masks, involving 3,942 different objects. - **Average video length**: 9.61 seconds. ### Baseline model To evaluate this new task, the authors proposed a new baseline model **Motion - Grounded Video Reasoning Assistant (MORA)**. MORA combines the reasoning ability of multimodal LLM, the pixel - level perception ability of the SAM model and the temporal perception ability of a lightweight localization head. Experimental results show that MORA has achieved significant performance improvement on the GROUND MORE dataset, but there is still much room for improvement. ### Contributions 1. **Introducing a new task**: Proposed the motion - based video reasoning task, filling the gap between referring VOS/action detection and motion - related video reasoning. 2. **Constructing a large - scale dataset**: Collected the GROUND MORE dataset for evaluating the comprehensive motion understanding ability of models. 3. **Evaluating existing models**: Conducted a comprehensive evaluation of existing image/video localization baseline models, revealing their deficiencies in motion understanding. 4. **Proposing a new baseline model**: Proposed the MORA model, achieving SOTA performance on GROUND MORE and pointing out the direction for future improvement. In conclusion, this paper promotes the research progress in the field of motion understanding by introducing the MGVR task and the GROUND MORE dataset, providing new challenges and opportunities for future multimodal video reasoning.

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

Exploring Motion and Appearance Information for Temporal Sentence Grounding.

Look, Remember and Reason: Grounded reasoning in videos with language models

Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Visual Relation Grounding in Videos

From Recognition to Cognition: Visual Commonsense Reasoning

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

Tracking with Human-Intent Reasoning

R2G: Reasoning to Ground in 3D Scenes

3D Concept Learning and Reasoning from Multi-View Images

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

End-to-end Multi-modal Video Temporal Grounding