Abstract:Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at <a class="link-external link-https" href="https://github.com/IRMVLab/Diff-IP2D" rel="external noopener nofollow">this https URL</a>.

EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

Expressive Forecasting of 3D Whole-body Human Motions

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Ego-Body Pose Estimation via Ego-Head Pose Estimation

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes

GIMO: Gaze-Informed Human Motion Prediction in Context.

3D Hand Pose Estimation in Everyday Egocentric Images

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition.

Estimating Body and Hand Motion in an Ego-sensed World

EgoMimic: Scaling Imitation Learning via Egocentric Video

Egocentric Human Activities Recognition With Multimodal Interaction Sensing

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Egocentric Prediction of Action Target in 3D

EgoNav: Egocentric Scene-aware Human Trajectory Prediction

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting.

Intention-Conditioned Long-Term Human Egocentric Action Forecasting