MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Junyi Ma,Xieyuanli Chen,Wentao Bao,Jingyi Xu,Hesheng Wang
2024-09-04
Abstract:Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: <a class="link-external link-https" href="https://irmvlab.github.io/madiff.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the problem of understanding human intentions and actions through first-person perspective (egocentric) video, with a particular focus on the task of hand trajectory prediction. Specifically, the paper attempts to solve the following issues: 1. **Camera Motion Guidance Issue**: - Directly predicting future hand trajectories on the 2D image plane presents spatial ambiguity, leading to significant differences between motion in 2D pixels and actual 3D physical actions. - Utilizing the motion information of the camera wearer to narrow this gap. 2. **Lack of Object Manipulability Labels Issue**: - Existing hand trajectory prediction models typically require object manipulability labels to optimize hand trajectory distribution, but these labels are difficult to obtain and automatic labeling quality is not high. - Proposing a foundational model to integrate visual and language features, thereby capturing high-level semantics without the need for manipulability labels. 3. **Causality and Motion Continuity Constraints Issue**: - Convolutional and transformer models in hand trajectory prediction overlook causality and motion continuity, making it difficult to capture the association between hands and body. - Designing a new loss function to better optimize hand trajectory prediction, aligning it with the underlying physical model. To address these issues, the authors propose MADiff (Motion-Aware Mamba Diffusion Models), a new method based on diffusion models for predicting future hand trajectories. This method leverages the motion information of the camera wearer and combines visual and language features to improve the accuracy and stability of the predictions.