Learning Positional Priors for Pretraining 2D Pose Estimators.
Kun Zhang,Ping Yao,Rui Wu,Chuanguang Yang,Ding Li,Min Du,Kai Deng,Renbiao Liu,Tianyao Zheng
DOI: https://doi.org/10.1145/3475723.3484252
2021-01-01
Abstract:The target of 2D human pose estimation is to locate the keypoints of body parts from 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning neural networks, which are usually initialized randomly or using classification models on large dataset, such as ImageNet, for their backbones. According to statistical data, there are strong positional priors for human keypoints, which are highly dependent on their relationship between image patches. To learn positional priors for pretraining pose estimators, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as self-supervised pretext task, whose target is to predict the location of each patch from an image composed of shuffled patches. During pretraining, we only use person images in MS-COCO, rather than introducing extra large dataset like ImageNet. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimators, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.