Abstract:Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at {

Geometric-aware RGB-D Representation Learning for Hand-Object Reconstruction

In-Hand 3D Object Reconstruction from a Monocular RGB Video

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

DexRepNet: Learning Dexterous Robotic Grasping Network with Geometric and Spatial Hand-Object Representations

Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image

Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images

RGB-D Object Recognition Via Incorporating Latent Data Structure and Prior Knowledge

Chunkfusion: A Learning-Based RGB-D 3D Reconstruction Framework Via Chunk-Wise Integration

Robust 3D Reconstruction with an RGB-D Camera

Multiple Feature Fusion Based Hand-held Object Recognition with RGB-D data

Hand-Crafted Features or Machine Learnt Features? Together They Improve RGB-D Object Recognition

3D Hand Pose Estimation and Reconstruction Based on Multi-Feature Fusion

Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition

Attention-Based RGBD Fusenet for Monocular 3D Body Geometry and Pose Reconstruction.

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

An Efficient Color and Geometric Feature Fusion Module for 6D Object Pose Estiamtion

DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from a Single RGB Image

Exploiting Enhanced and Robust RGB-D Face Representation Via Progressive Multi-Modal Learning