Abstract:Multi-person pose estimation has achieved great progress in recent years, even though, the precise prediction for occluded and invisible hard keypoints remains challenging. Most of the human pose estimation networks are equipped with an image classification-based pose encoder for feature extraction and a handcrafted pose decoder for high-resolution representations. However, the pose encoder might be sub-optimal because of the gap between image classification and pose estimation. The widely used multi-scale feature fusion in pose decoder is still coarse and cannot provide sufficient high-resolution details for hard keypoints. Neural Architecture Search (NAS) has shown great potential in many visual tasks to automatically search efficient networks. In this work, we present the Pose-native Network Architecture Search (PoseNAS) to simultaneously design a better pose encoder and pose decoder for pose estimation. Specifically, we directly search a data-oriented pose encoder with stacked searchable cells, which can provide an optimum feature extractor for the pose specific task. In the pose decoder, we exploit scale-adaptive fusion cells to promote rich information exchange across the multi-scale feature maps. Meanwhile, the pose decoder adopts a Fusion-and-Enhancement manner to progressively boost the high-resolution representations that are non-trivial for the precious prediction of hard keypoints. With the exquisitely designed search space and search strategy, PoseNAS can simultaneously search all modules in an end-to-end manner. PoseNAS achieves state-of-the-art performance on three public datasets, MPII, COCO, and PoseTrack, with small-scale parameters compared with the existing methods. Our best model obtains 76.7% mAP and 75.9% mAP on the COCO validation set and test set with only 33.6M parameters. Code and implementation are available at https://github.com/for-code0216/PoseNAS.

FaSRnet: a feature and semantics refinement network for human pose estimation

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

Learning to Refine Human Pose Estimation

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Deep Dual Consecutive Network for Human Pose Estimation

PoseRN: A 2D pose refinement network for bias-free multi-view 3D human pose estimation

Temporal Feature Enhancing Network for Human Pose Estimation in Videos.

RFFCE: Residual Feature Fusion and Confidence Evaluation Network for 6dof Pose Estimation.

SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement

SCRN: Stepwise Change and Refine Network Based Semantic Distribution for Human Pose Transfer

Diffusion Based Coarse-to-Fine Network for 3D Human Pose and Shape Estimation from Monocular Video

Full-Resolution Encoder-Decoder Networks with Multi-Scale Feature Fusion for Human Pose Estimation

Human Pose Estimation Using Exemplars and Part Based Refinement

Pose-native Network Architecture Search for Multi-person Human Pose Estimation

Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos

Deep High-Resolution Representation Learning For Human Pose Estimation

Improving Human Pose Estimation Based on Stacked Hourglass Network

Multi-Level Network for High-Speed Multi-Person Pose Estimation

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention

Joint Multi-Person Pose Estimation and Semantic Part Segmentation