Abstract:Reconstructing a 3D hand mesh from a single RGB image is challenging due to complex articulations, self-occlusions, and depth ambiguities. Traditional discriminative methods, which learn a deterministic mapping from a 2D image to a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D mapping. To address this challenge, we propose MMHMR, a novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes by learning and sampling from the probabilistic distribution of the ambiguous 2D-to-3D mapping process. MMHMR consists of two key components: (1) a VQ-MANO, which encodes 3D hand articulations as discrete pose tokens in a latent space, and (2) a Context-Guided Masked Transformer that randomly masks out pose tokens and learns their joint distribution, conditioned on corrupted token sequences, image context, and 2D pose cues. This learned distribution facilitates confidence-guided sampling during inference, producing mesh reconstructions with low uncertainty and high precision. Extensive evaluations on benchmark and real-world datasets demonstrate that MMHMR achieves state-of-the-art accuracy, robustness, and realism in 3D hand mesh reconstruction. Project website: <a class="link-external link-https" href="https://m-usamasaleem.github.io/publication/MMHMR/mmhmr.html" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to solve the problem of reconstructing 3D hand meshes from a single RGB image. Specifically, existing discriminative methods have limitations when dealing with complex scenes, such as self - occlusion, hand - object interactions, and view - point changes, which make it difficult to accurately reconstruct 3D hand meshes. To address these challenges, the paper proposes a new generative masked model - MMHMR (Generative Masked Modeling for Hand Mesh Recovery), aiming to synthesize possible 3D hand meshes by learning and sampling the probability distribution of 2D - to - 3D mappings. ### Main Problems 1. **Complex Hand Poses and Occlusions**: Hand poses are very complex, especially when the hand is self - occluded or interacting with other objects. Traditional discriminative methods have difficulty accurately reconstructing 3D hand meshes. 2. **Depth Uncertainty**: There is depth ambiguity in the mapping from 2D images to 3D space, that is, the same 2D image can correspond to multiple different 3D structures. 3. **Limitations of Existing Methods**: Existing discriminative - based methods (such as METRO, MeshGraphormer, etc.) perform well in some aspects, but due to their deterministic output methods, they still have shortcomings when dealing with complex scenes. ### Solutions The paper proposes a new generative masked model - MMHMR. Its core idea is to learn the probability distribution of 2D - to - 3D mappings and sample high - confidence 3D hand meshes from it. Specifically: - **VQ - MANO**: Encode continuous 3D hand pose parameters into discrete pose tokens and pre - train using a vector - quantized variational auto - encoder (VQ - VAE). - **Context - Guided Masked Transformer**: Randomly occlude some pose tokens and learn the joint distribution of these tokens based on the input image, 2D pose cues, and the partially occluded pose token sequence. This learning process enables the model to perform confidence - guided sampling during inference, thereby generating 3D hand meshes with low uncertainty and high precision. ### Key Contributions 1. **First Use of Generative Masked Modeling**: Synthesize high - confidence 3D hand meshes by explicitly learning the probability mapping from 2D to 3D. 2. **Design of Context - Guided Masked Transformer**: Effectively fuse multiple context cues, including 2D pose, image features, and unoccluded 3D pose tokens. 3. **Differentiated Mask Training**: Learn the distribution of hand pose tokens, conditional on all context cues, so as to perform confidence - guided sampling during inference and generate 3D hand meshes with low uncertainty and high precision. ### Experimental Results Experiments show that MMHMR significantly outperforms existing methods on multiple benchmark datasets, especially when dealing with complex scenes (such as occlusion, hand - object interactions, etc.). For example, on the HO3Dv3 dataset, the PA - MPJPE and PA - MPVPE of MMHMR are reduced by approximately 19.5% and 15.7% respectively, demonstrating its accuracy in dealing with complex hand poses and occlusions. In summary, by introducing the generative masked model MMHMR, this paper effectively solves the challenges faced in reconstructing 3D hand meshes from a single RGB image, and has made significant progress especially in terms of robustness and accuracy in complex scenes.

MMHMR: Generative Masked Modeling for Hand Mesh Recovery

CAMInterHand: Cooperative Attention for Multi-View Interactive Hand Pose and Mesh Reconstruction

In-Hand 3D Object Reconstruction from a Monocular RGB Video

MH‐HMR: Human mesh recovery from monocular images via multi‐hypothesis learning

VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds

HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling

A Probabilistic Attention Model with Occlusion-aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image

3D Hand Mesh Recovery from Monocular RGB in Camera Space

End-to-end Recovery of Human Shape and Pose

End-to-End Weakly-Supervised Single-Stage Multiple 3d Hand Mesh Reconstruction from a Single Rgb Image

DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling

HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images.

Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models

Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering

MLPHand: Real Time Multi-View 3D Hand Reconstruction Via MLP Modeling

SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

RealisticHands: A Hybrid Model for 3D Hand Reconstruction

Human Mesh Recovery from Arbitrary Multi-view Images

MMM: Generative Masked Motion Model