ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

Jiajun Zhang,Yuxiang Zhang,Liang An,Mengcheng Li,Hongwen Zhang,Zonghai Hu,Yebin Liu
2024-09-14
Abstract:Dynamic and dexterous manipulation of objects presents a complex challenge, requiring the synchronization of hand motions with the trajectories of objects to achieve seamless and physically plausible interactions. In this work, we introduce ManiDext, a unified hierarchical diffusion-based framework for generating hand manipulation and grasp poses based on 3D object trajectories. Our key insight is that accurately modeling the contact correspondences between objects and hands during interactions is crucial. Therefore, we propose a continuous correspondence embedding representation that specifies detailed hand correspondences at the vertex level between the object and the hand. This embedding is optimized directly on the hand mesh in a self-supervised manner, with the distance between embeddings reflecting the geodesic distance. Our framework first generates contact maps and correspondence embeddings on the object's surface. Based on these fine-grained correspondences, we introduce a novel approach that integrates the iterative refinement process into the diffusion process during the second stage of hand pose generation. At each step of the denoising process, we incorporate the current hand pose residual as a refinement target into the network, guiding the network to correct inaccurate hand poses. Introducing residuals into each denoising step inherently aligns with traditional optimization process, effectively merging generation and refinement into a single unified framework. Extensive experiments demonstrate that our approach can generate physically plausible and highly realistic motions for various tasks, including single and bimanual hand grasping as well as manipulating both rigid and articulated objects. Code will be available for research purposes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate realistic and physically reasonable bimanual manipulation postures based on the motion trajectories of objects. Specifically, the authors propose a new framework named ManiDext, aiming to synthesize flexible hand - manipulation actions only by the given motion trajectories of objects. This task has the following challenges: 1. **Accurately model the contact area between the hand and the object**: During the manipulation process, the contact area between the hand and the object is dynamically changing and needs to be accurately modeled. 2. **Generate a physically reasonable motion sequence**: The generated hand motions must conform to the laws of physics, avoiding unnatural movements, penetrations or detachments from the object surface. ### Main contributions 1. **ManiDext framework**: This is the first framework based on the diffusion model, specifically used to generate bimanual manipulation actions according to the motion trajectories of objects. 2. **Continuous Correspondence Embedding**: This method models the complex correspondence between the hand and the object more accurately than previous contact probability maps or discrete hand labels. Specifically, it provides detailed corresponding hand - vertex information for each object vertex and calculates these embedding vectors through self - supervised optimization, so that the distance reflects the geodesic distance. 3. **Residual - Guided Diffusion Module**: This module introduces the residual error of the hand pose as a condition into each step of the denoising step in the diffusion process, thereby effectively combining the generation and optimization processes and ensuring that the generated hand poses are more accurate and natural. ### Method overview ManiDext adopts a hierarchical diffusion model framework, which is divided into two stages: - **First stage**: Generate contact information based on the motion trajectories of objects, including contact probability maps and continuous correspondence embedding maps. This information provides a dense correspondence between the object surface and hand vertices, allowing the calculation of geometric residual errors. - **Second stage**: Use the generated contact information and the calculated residual errors as conditions to generate the final hand - manipulation postures. Through iterative denoising steps, gradually optimize the hand postures to ensure their physical rationality and smooth transitions. ### Experimental verification The authors conducted quantitative and qualitative experiments on multiple datasets, including ARCTIC, GRAB and HOI4D, covering single - hand and bimanual interaction scenarios with rigid and articulated objects. The experimental results show that this method can generate high - quality, smooth, dynamic and physically reasonable manipulation postures, which are suitable for a wide range of hand - object interaction modes. ### Formula summary - **Geodesic distance matrix**: \[ G_{ij}=\text{geodesic distance between vertex }s_i\text{ and vertex }s_j \] - **Embedding vector optimization objective**: \[ L_{\text{embedding}} = L_{\text{bce}}(\Phi_{\text{emd}}, \Phi_{\text{gt}}) \] where, \[ \Phi_{\text{emd}}=\exp(-\|E_i - E_j\|) \] \[ \Phi_{\text{gt}}=\exp\left(-\frac{G^2}{2\sigma^2}\right) \] Through these innovations, ManiDext significantly improves the quality and efficiency of hand - manipulation synthesis, providing new solutions for virtual reality, robotics and human - computer interaction fields.