Andong Deng,Tongjia Chen,Shoubin Yu,Taojiannan Yang,Lincoln Spencer,Yapeng Tian,Ajmal Saeed Mian,Mohit Bansal,Chen Chen
Abstract:In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large-scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain texts. It evaluates models on both spatiotemporal grounding and reasoning, fostering to address complex challenges in motion-related video reasoning, temporal perception, and pixel-level understanding. Furthermore, we introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA). MORA incorporates the multimodal reasoning ability from the Multimodal LLM, the pixel-level perception capability from the grounding model (SAM), and the temporal perception ability from a lightweight localization head. MORA achieves respectable performance on GROUNDMORE outperforming the best existing visual grounding baseline model by an average of 21.5% relatively. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve the problem of motion understanding in dynamic scenes in videos. In particular, by introducing a new task - **Motion - Grounded Video Reasoning (MGVR)**, it evaluates the reasoning and perception abilities of multimodal models in terms of motion understanding. Specifically, the MGVR task requires the model to generate visual answers (video segmentation masks) based on the input questions, which requires the model to have the ability of implicit spatio - temporal reasoning and localization.
### Task definition
The basic definition of the **Motion - Grounded Video Reasoning** task is as follows:
- **Input**: A video clip \( V\in\mathbb{R}^{t\times h\times w\times3}\) (where \( t\), \( w\), \( h\) and 3 represent the video length, width, height and number of channels respectively), and a motion - related question \( Q\).
- **Output**: A binary object segmentation mask \( M\in\mathbb{R}^{t'\times h\times w}\) related to motion (where \( t'\leq t\)).
### Task challenges
The main challenges of this task are:
1. **Motion - related reasoning ability**: The model needs to understand the relationship between the target motion and its spatio - temporal context. For example, in the video, "The girl took out the dog food from the cabinet and then fed the dog". For the motion "feed", to fully understand this concept, it is necessary to perceive its spatial context "the girl" and "the dog food", as well as its temporal context "took out the dog food from the cabinet".
2. **Pixel - level understanding ability of moving objects**: The model not only needs to reason out the answer, but also needs to generate a series of spatio - temporal masks to represent the answer, because relying solely on language output cannot avoid biases (for example, in common ball - related videos, when asked about the action "play", existing QA models tend to answer "ball" even without visual cues). This visual response method can ensure whether the model is aware of the time when the motion occurs and the objects involved.
### Dataset
To support this new task, the authors collected a large - scale dataset **GROUND MORE** with the following characteristics:
- **Number of videos**: 1,715 video clips.
- **Question types**: 4 types of questions (causal, sequential, counterfactual and descriptive).
- **Object masks**: 249,500 object masks, involving 3,942 different objects.
- **Average video length**: 9.61 seconds.
### Baseline model
To evaluate this new task, the authors proposed a new baseline model **Motion - Grounded Video Reasoning Assistant (MORA)**. MORA combines the reasoning ability of multimodal LLM, the pixel - level perception ability of the SAM model and the temporal perception ability of a lightweight localization head. Experimental results show that MORA has achieved significant performance improvement on the GROUND MORE dataset, but there is still much room for improvement.
### Contributions
1. **Introducing a new task**: Proposed the motion - based video reasoning task, filling the gap between referring VOS/action detection and motion - related video reasoning.
2. **Constructing a large - scale dataset**: Collected the GROUND MORE dataset for evaluating the comprehensive motion understanding ability of models.
3. **Evaluating existing models**: Conducted a comprehensive evaluation of existing image/video localization baseline models, revealing their deficiencies in motion understanding.
4. **Proposing a new baseline model**: Proposed the MORA model, achieving SOTA performance on GROUND MORE and pointing out the direction for future improvement.
In conclusion, this paper promotes the research progress in the field of motion understanding by introducing the MGVR task and the GROUND MORE dataset, providing new challenges and opportunities for future multimodal video reasoning.