Abstract:Due to the inherent limitations of a single viewpoint, reconstructing 3D human meshes from a single image has long been a challenging task. While deep learning networks enable us to approximate the shape of unseen sides, capturing the texture details of the non-visible side remains difficult with just one image. Traditional methods utilize Generative Adversarial Networks (GANs) to predict the normal maps of the non-visible side, thereby inferring detailed textures and wrinkles on the model's surface. However, we have identified challenges with existing normal prediction networks when dealing with complex scenes, such as a lack of focus on local features and insufficient modeling of spatial relationships.To address these challenges, we introduce EMAR—Enhanced Multi-scale Attention-Driven Single-Image 3D Human Reconstruction. This approach incorporates a novel Enhanced Multi-Scale Attention (EMSA) mechanism, which excels at capturing intricate features and global relationships in complex scenes. EMSA surpasses traditional single-scale attention mechanisms by adaptively adjusting the weights between features, enabling the network to more effectively leverage information across various scales. Furthermore, we have improved the feature fusion method to better integrate representations from different scales. This enhanced feature fusion allows the network to more comprehensively understand both fine details and global structures within the image. Finally, we have designed a hybrid loss function tailored to the introduced attention mechanism and feature fusion method, optimizing the network's training process and enhancing the quality of reconstruction results. Our network demonstrates significant improvements in performance for 3D human model reconstruction. Experimental results show that our method exhibits greater robustness to challenging poses compared to traditional single-scale approaches.

Autoencoder and Masked Image Encoding-Based Attentional Pose Network.

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

PANet: A Pixel-Level Attention Network for 6D Pose Estimation With Embedding Vector Features

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

A comprehensive framework for occluded human pose estimation

Pose Mask: A Model-Based Augmentation Method for 2D Pose Estimation in Classroom Scenes Using Surveillance Images

Ssman: self-supervised masked adaptive network for 3D human pose estimation

Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation.

Attention Guided 6D Object Pose Estimation with Multi-constraints Voting Network

Pose-native Network Architecture Search for Multi-person Human Pose Estimation

Learning a Robust Part-Aware Monocular 3D Human Pose Estimator via Neural Architecture Search

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

PARE: Part Attention Regressor for 3D Human Body Estimation

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Mask6D: Masked Pose Priors for 6D Object Pose Estimation.

Optimized S2E Attention Block based Convolutional Network for Human Pose Estimation

Densely Connected Attentional Pyramid Residual Network for Human Pose Estimation.

PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers