Abstract:Due to the inherent limitations of a single viewpoint, reconstructing 3D human meshes from a single image has long been a challenging task. While deep learning networks enable us to approximate the shape of unseen sides, capturing the texture details of the non-visible side remains difficult with just one image. Traditional methods utilize Generative Adversarial Networks (GANs) to predict the normal maps of the non-visible side, thereby inferring detailed textures and wrinkles on the model's surface. However, we have identified challenges with existing normal prediction networks when dealing with complex scenes, such as a lack of focus on local features and insufficient modeling of spatial relationships.To address these challenges, we introduce EMAR—Enhanced Multi-scale Attention-Driven Single-Image 3D Human Reconstruction. This approach incorporates a novel Enhanced Multi-Scale Attention (EMSA) mechanism, which excels at capturing intricate features and global relationships in complex scenes. EMSA surpasses traditional single-scale attention mechanisms by adaptively adjusting the weights between features, enabling the network to more effectively leverage information across various scales. Furthermore, we have improved the feature fusion method to better integrate representations from different scales. This enhanced feature fusion allows the network to more comprehensively understand both fine details and global structures within the image. Finally, we have designed a hybrid loss function tailored to the introduced attention mechanism and feature fusion method, optimizing the network's training process and enhancing the quality of reconstruction results. Our network demonstrates significant improvements in performance for 3D human model reconstruction. Experimental results show that our method exhibits greater robustness to challenging poses compared to traditional single-scale approaches.

Human Pose Estimation Based on Feature Enhancement and Multi-Scale Feature Fusion

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

Multi-Scale Structure-Aware Network for Human Pose Estimation

Multi-Scale Supervised Network for Human Pose Estimation

Full-Resolution Encoder-Decoder Networks with Multi-Scale Feature Fusion for Human Pose Estimation

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Complementary Feature Pyramid Network for Human Pose Estimation

Multi-person pose estimation using atrous convolution

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention

Improving Human Pose Estimation Based on Stacked Hourglass Network

A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation

Learning high resolution reservation for human pose estimation

Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation

Pose-native Network Architecture Search for Multi-person Human Pose Estimation

Deep Dual Consecutive Network for Human Pose Estimation

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information

Human Pose Estimation Based on Parallel Atrous Convolution and Body Structure Constraints

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation

Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image