Abstract:Pose estimation in crowded scenes is key to understanding human behavior in real-life applications. Most existing CNN-based pose estimation methods often depend on the appearance of visible parts as cues to localize human joints. However, occlusion is typical in crowded scenes, and invisible body parts have no valid features for joint localization. Introducing prior information about the human pose structure to infer the locations of occluded parts is a natural solution to this problem. In this paper, we argue that learning structural information based on human joints alone is not enough to address human body variations and could be prone to overfitting. From a perspective on the human pose as a dual representation of joints and limbs, we propose a pose refinement network, coined as dual graph network (DGN), to jointly learn its structural information of body joints and limbs by incorporating the cooperative constraints between two branches. Specifically, our DGN has two coupled graph convolutional network (GCN) branches to model the structure information of joints and limbs. Each stage in the branch is composed of a feature aggregator and a GCN module for inter-branch information fusion and intra-branch context extraction, respectively. In addition, to enhance the modeling capacity of GCN, we design an adaptive GCN layer (AGL) embedded in the GCN module to handle each pose instance based on its graph structure. We also propose a heatmap-guided sampling to leverage the features of the body parts to provide rich visual features for the inference of occluded parts. We perform extensive experiments on five challenging datasets to demonstrate the effectiveness of our DGN on pose estimation. Our DGN obtains significant performance improvement from 67.9 to 72.4 mAP in the CrowdPose dataset with the same CNN-based pose estimator and training strategy as the OPEC-Net. It shows that, compared to the OPEC-Net only considering joints, our DGN has a clear advantage due to the joint consideration of both joints and limbs. Meanwhile, our DGN is also helpful for pose estimation in general datasets (i.e., COCO and Pose track) with less occlusion and mutual interference, demonstrating the generalization power of DGN on refining human poses.

HRNeXt: High-Resolution Context Network for Crowd Pose Estimation

Context-Guided Adaptive Network for Efficient Human Pose Estimation.

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

FDN: Feature Decoupling Network for Head Pose Estimation.

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

Lightweight high-resolution network based on adaptive cross-dimensional weighting for human pose estimation

Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation

Implicit Decouple Network for Efficient Pose Estimation

CONet: Crowd and occlusion-aware network for occluded human pose estimation

A-HRNet: Attention Based High Resolution Network for Human Pose Estimation

Adaptive Hypergraph Neural Network for Multi-Person Pose Estimation

Multi-Context Attention for Human Pose Estimation.

Dual Graph Networks for Pose Estimation in Crowded Scenes

DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Improving Human Pose Estimation Based on Stacked Hourglass Network

Ghost attentional down net: An effective lightweight top-down network for human pose estimation

Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)

Efficient Human Pose Estimation in Hierarchical Context

CFRLA-Net: A Context-aware Feature Representation Learning Anchor-free Network for Pedestrian Detection

Multi-Stage HRNet: Multiple Stage High-Resolution Network for Human Pose Estimation

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention