Abstract:Recent years have witnessed a trend of the deep integration of the generation and reconstruction paradigms. In this paper, we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation, inpainting, reconstruction, and fitting in a single framework, which we name as Holistic Hand Mesh Recovery (HHMR). Our key observation is that different kinds of hand mesh recovery tasks can be achieved by a single generative model with strong multimodal controllability, and in such a framework, realizing different tasks only requires giving different signals as conditions. To achieve this goal, we propose an all-in-one diffusion framework based on graph convolution and attention mechanisms for holistic hand mesh recovery. In order to achieve strong control generation capability while ensuring the decoupling of multimodal control signals, we map different modalities to a shared feature space and apply cross-scale random masking in both modality and feature levels. In this way, the correlation between different modalities can be fully exploited during the learning of hand priors. Furthermore, we propose Condition-aligned Gradient Guidance to enhance the alignment of the generated model with the control signals, which significantly improves the accuracy of the hand mesh reconstruction and fitting. Experiments show that our novel framework can realize multiple hand mesh recovery tasks simultaneously and outperform the existing methods in different tasks, which provides more possibilities for subsequent downstream applications including gesture recognition, pose generation, mesh editing, and so on.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a unified framework that can handle multiple hand - mesh recovery tasks simultaneously, including direct hand - mesh generation, inpainting, reconstruction, and fitting. Specifically, the authors propose a framework named Holistic Hand Mesh Recovery (HHMR), aiming to achieve this goal by enhancing the multi - modal controllability of the diffusion model. ### Main Problems 1. **Single Framework for Multiple Tasks**: Existing hand - mesh recovery methods are usually only able to handle specific tasks, such as only being able to perform reconstruction or generation. The goal of this paper is to build a unified framework that can handle multiple hand - mesh recovery tasks. 2. **Multi - modal Controllability**: Different types of input conditions (such as images, 2D/3D skeletons, etc.) need to be effectively processed in the same framework. This requires the model to have strong multi - modal controllability, that is, to generate the corresponding hand - mesh according to different input conditions. 3. **High - Quality Generation and Reconstruction**: Ensure that the generated hand - meshes are not only diverse but also conform to biomechanical constraints, and at the same time, be able to accurately restore hand postures and meshes in reconstruction tasks. ### Solution Overview To solve the above problems, the authors propose the following key techniques and methods: 1. **Diffusion Framework Based on Graph Convolution and Attention Mechanism**: - A unified diffusion model framework is constructed using a graph convolutional network (GCN) and an attention mechanism. - This framework can share the feature space among different tasks and enhance the association learning between different modalities through a random masking strategy. 2. **Condition - aligned Gradient Guidance**: - A condition - aligned gradient guidance strategy is proposed to improve the consistency between the generated results and the input conditions. - This method makes the generated hand - mesh more in line with the given conditions by adding a gradient bias in the reverse diffusion process. 3. **Multi - modal Input Processing**: - It supports multiple types of input conditions, such as RGB images, 2D/3D skeletons, etc. - By mapping different modalities to the shared feature space and applying a random masking strategy, the generalization ability and diversity generation ability of the model are enhanced. 4. **Experimental Verification**: - Extensive experiments have been carried out on multiple downstream tasks, including hand - mesh generation, inpainting, reconstruction, and fitting. - The experimental results show that the performance of this framework on different tasks is better than that of existing methods, especially in multi - hypothesis reconstruction tasks. ### Formula Representation The formulas involved in the paper are as follows: - Forward process of the diffusion model: \[ p(x_t | x_{t - 1})=\mathcal{N}(\sqrt{1 - \beta_t}x_{t - 1},\beta_tI) \] where \(\{\beta_t\}\) is a set of predefined small constants. - Reverse process: \[ p_\theta(x_{t - 1} | x_t,c)=\mathcal{N}(\mu_\theta_t(x_t,t,c),\Sigma_tI) \] where \(c\) represents the generation condition, and \(\mu_\theta_t\) and \(\Sigma_t\) are the mean and variance of the reverse Gaussian process respectively. - Condition - aligned Gradient Guidance: \[ \bar{\mu}_t=\mu_t - s\Sigma_t\nabla_{x_t}\|Pf_\theta(x_t,c,t)-Px_0\| \] where \(s\) is a scaling factor and \(P\) is a task - specific operator. Through these methods and techniques, the HHMR framework can achieve multiple hand - mesh recovery tasks in a single model and perform well on each task.

HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models

CAMInterHand: Cooperative Attention for Multi-View Interactive Hand Pose and Mesh Reconstruction

MMHMR: Generative Masked Modeling for Hand Mesh Recovery

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Coarse-to-fine cascaded 3D hand reconstruction based on SSGC and MHSA

Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering

BiHand: Recovering Hand Mesh with Multi-stage Bisected Hourglass Networks

3D Hand Mesh Recovery from Monocular RGB in Camera Space

HandFormer: Hand Pose Reconstructing from a Single RGB Image

ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration

End-to-End Hand Mesh Recovery from a Monocular RGB Image

G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions

3D Hand Reconstruction via Aggregating Intra and Inter Graphs Guided by Prior Knowledge for Hand-Object Interaction Scenario

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Pixel-Aligned Non-parametric Hand Mesh Reconstruction

GenHMR: Generative Human Mesh Recovery