MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Junyi Ma,Xieyuanli Chen,Wentao Bao,Jingyi Xu,Hesheng Wang

2024-09-04

Abstract:Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: <a class="link-external link-https" href="https://irmvlab.github.io/madiff.github.io" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the problem of understanding human intentions and actions through first-person perspective (egocentric) video, with a particular focus on the task of hand trajectory prediction. Specifically, the paper attempts to solve the following issues: 1. **Camera Motion Guidance Issue**: - Directly predicting future hand trajectories on the 2D image plane presents spatial ambiguity, leading to significant differences between motion in 2D pixels and actual 3D physical actions. - Utilizing the motion information of the camera wearer to narrow this gap. 2. **Lack of Object Manipulability Labels Issue**: - Existing hand trajectory prediction models typically require object manipulability labels to optimize hand trajectory distribution, but these labels are difficult to obtain and automatic labeling quality is not high. - Proposing a foundational model to integrate visual and language features, thereby capturing high-level semantics without the need for manipulability labels. 3. **Causality and Motion Continuity Constraints Issue**: - Convolutional and transformer models in hand trajectory prediction overlook causality and motion continuity, making it difficult to capture the association between hands and body. - Designing a new loss function to better optimize hand trajectory prediction, aligning it with the underlying physical model. To address these issues, the authors propose MADiff (Motion-Aware Mamba Diffusion Models), a new method based on diffusion models for predicting future hand trajectories. This method leverages the motion information of the camera wearer and combines visual and language features to improve the accuracy and stability of the predictions.

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Forecasting Distillation: Enhancing 3D Human Motion Prediction with Guidance Regularization

A motion conditioned diffusion model for real-time hand trajectory semantic prediction

EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

Enhanced Multimodal Trajectory Prediction for Autonomous Vehicles Using Advanced Diffusion Model Techniques

Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction

HumanMAC: Masked Motion Completion for Human Motion Prediction

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

E-Motion: Future Motion Simulation via Event Sequence Diffusion

MV-Diffusion: Motion-aware Video Diffusion Model

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Expressive Forecasting of 3D Whole-body Human Motions

ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty

Motion Latent Diffusion for Stochastic Trajectory Prediction.

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

GIMO: Gaze-Informed Human Motion Prediction in Context.

EgoNav: Egocentric Scene-aware Human Trajectory Prediction