Abstract:Due to the inherent limitations of a single viewpoint, reconstructing 3D human meshes from a single image has long been a challenging task. While deep learning networks enable us to approximate the shape of unseen sides, capturing the texture details of the non-visible side remains difficult with just one image. Traditional methods utilize Generative Adversarial Networks (GANs) to predict the normal maps of the non-visible side, thereby inferring detailed textures and wrinkles on the model's surface. However, we have identified challenges with existing normal prediction networks when dealing with complex scenes, such as a lack of focus on local features and insufficient modeling of spatial relationships.To address these challenges, we introduce EMAR—Enhanced Multi-scale Attention-Driven Single-Image 3D Human Reconstruction. This approach incorporates a novel Enhanced Multi-Scale Attention (EMSA) mechanism, which excels at capturing intricate features and global relationships in complex scenes. EMSA surpasses traditional single-scale attention mechanisms by adaptively adjusting the weights between features, enabling the network to more effectively leverage information across various scales. Furthermore, we have improved the feature fusion method to better integrate representations from different scales. This enhanced feature fusion allows the network to more comprehensively understand both fine details and global structures within the image. Finally, we have designed a hybrid loss function tailored to the introduced attention mechanism and feature fusion method, optimizing the network's training process and enhancing the quality of reconstruction results. Our network demonstrates significant improvements in performance for 3D human model reconstruction. Experimental results show that our method exhibits greater robustness to challenging poses compared to traditional single-scale approaches.

Learning Pose Controllable Human Reconstruction with Dynamic Implicit Fields from a Single Image

Reconstructing 3D human pose and shape from a single image and sparse IMUs

Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

3D Human Reconstruction from A Single Depth Image

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Human as Points: Explicit Point-based 3D Human Reconstruction from Single-view RGB Images

Single-view 3D Body and Cloth Reconstruction under Complex Poses

Subject-Specific Human Modeling for Human Pose Estimation

Robust Estimation of 3D Human Poses from a Single Image

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

Implicit 3D Human Reconstruction Guided by Parametric Models and Normal Maps

Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Implicit 3D Human Mesh Recovery using Consistency with Pose and Shape from Unseen-view

3D Human Pose Estimation with Single Image and Inertial Measurement Unit (IMU) Sequence

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

Relative Pose Estimation for RGB-D Human Input Scans Via Implicit Function Reconstruction

Image-Guided Human Reconstruction via Multi-Scale Graph Transformation Networks

Detailed 3D Human Body Reconstruction from Multi-view Images Combining Voxel Super-Resolution and Learned Implicit Representation

Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image