Abstract:Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.

GraspDiff: Grasping Generation for Hand-Object Interaction With Multimodal Guided Diffusion

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models

GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

FastGrasp: Efficient Grasp Synthesis with Diffusion

UGG: Unified Generative Grasping

DexGrasp-Diffusion: Diffusion-based Unified Functional Grasp Synthesis Method for Multi-Dexterous Robotic Hands

Grasp Diffusion Network: Learning Grasp Generators from Partial Point Clouds with Diffusion Models in SO(3)xR3

Diffusion for Multi-Embodiment Grasping

DVGG: Deep Variational Grasp Generation for Dextrous Manipulation

DexGrasp-Diffusion: Diffusion-based Unified Functional Grasp Synthesis Pipeline for Multi-Dexterous Robotic Hands

ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

DexDiffuser: Generating Dexterous Grasps with Diffusion Models

GraspLDM: Generative 6-DoF Grasp Synthesis using Latent Diffusion Models

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

DMFC-GraspNet: Differentiable Multi-Fingered Robotic Grasp Generation in Cluttered Scenes

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

GenDexGrasp: Generalizable Dexterous Grasping