MMHMR: Generative Masked Modeling for Hand Mesh Recovery

Muhammad Usama Saleem,Ekkasit Pinyoanuntapong,Mayur Jagdishbhai Patel,Hongfei Xue,Ahmed Helmy,Srijan Das,Pu Wang
2024-12-18
Abstract:Reconstructing a 3D hand mesh from a single RGB image is challenging due to complex articulations, self-occlusions, and depth ambiguities. Traditional discriminative methods, which learn a deterministic mapping from a 2D image to a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D mapping. To address this challenge, we propose MMHMR, a novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes by learning and sampling from the probabilistic distribution of the ambiguous 2D-to-3D mapping process. MMHMR consists of two key components: (1) a VQ-MANO, which encodes 3D hand articulations as discrete pose tokens in a latent space, and (2) a Context-Guided Masked Transformer that randomly masks out pose tokens and learns their joint distribution, conditioned on corrupted token sequences, image context, and 2D pose cues. This learned distribution facilitates confidence-guided sampling during inference, producing mesh reconstructions with low uncertainty and high precision. Extensive evaluations on benchmark and real-world datasets demonstrate that MMHMR achieves state-of-the-art accuracy, robustness, and realism in 3D hand mesh reconstruction. Project website: <a class="link-external link-https" href="https://m-usamasaleem.github.io/publication/MMHMR/mmhmr.html" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of reconstructing 3D hand meshes from a single RGB image. Specifically, existing discriminative methods have limitations when dealing with complex scenes, such as self - occlusion, hand - object interactions, and view - point changes, which make it difficult to accurately reconstruct 3D hand meshes. To address these challenges, the paper proposes a new generative masked model - MMHMR (Generative Masked Modeling for Hand Mesh Recovery), aiming to synthesize possible 3D hand meshes by learning and sampling the probability distribution of 2D - to - 3D mappings. ### Main Problems 1. **Complex Hand Poses and Occlusions**: Hand poses are very complex, especially when the hand is self - occluded or interacting with other objects. Traditional discriminative methods have difficulty accurately reconstructing 3D hand meshes. 2. **Depth Uncertainty**: There is depth ambiguity in the mapping from 2D images to 3D space, that is, the same 2D image can correspond to multiple different 3D structures. 3. **Limitations of Existing Methods**: Existing discriminative - based methods (such as METRO, MeshGraphormer, etc.) perform well in some aspects, but due to their deterministic output methods, they still have shortcomings when dealing with complex scenes. ### Solutions The paper proposes a new generative masked model - MMHMR. Its core idea is to learn the probability distribution of 2D - to - 3D mappings and sample high - confidence 3D hand meshes from it. Specifically: - **VQ - MANO**: Encode continuous 3D hand pose parameters into discrete pose tokens and pre - train using a vector - quantized variational auto - encoder (VQ - VAE). - **Context - Guided Masked Transformer**: Randomly occlude some pose tokens and learn the joint distribution of these tokens based on the input image, 2D pose cues, and the partially occluded pose token sequence. This learning process enables the model to perform confidence - guided sampling during inference, thereby generating 3D hand meshes with low uncertainty and high precision. ### Key Contributions 1. **First Use of Generative Masked Modeling**: Synthesize high - confidence 3D hand meshes by explicitly learning the probability mapping from 2D to 3D. 2. **Design of Context - Guided Masked Transformer**: Effectively fuse multiple context cues, including 2D pose, image features, and unoccluded 3D pose tokens. 3. **Differentiated Mask Training**: Learn the distribution of hand pose tokens, conditional on all context cues, so as to perform confidence - guided sampling during inference and generate 3D hand meshes with low uncertainty and high precision. ### Experimental Results Experiments show that MMHMR significantly outperforms existing methods on multiple benchmark datasets, especially when dealing with complex scenes (such as occlusion, hand - object interactions, etc.). For example, on the HO3Dv3 dataset, the PA - MPJPE and PA - MPVPE of MMHMR are reduced by approximately 19.5% and 15.7% respectively, demonstrating its accuracy in dealing with complex hand poses and occlusions. In summary, by introducing the generative masked model MMHMR, this paper effectively solves the challenges faced in reconstructing 3D hand meshes from a single RGB image, and has made significant progress especially in terms of robustness and accuracy in complex scenes.