Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

Li Yuan,Shuning Chang,Xuecheng Nie,Ziyuan Huang,Yichen Zhou,Yunpeng Chen,Jiashi Feng,Shuicheng Yan
DOI: https://doi.org/10.1145/3394171.3416299
2020-10-21
Abstract:Video-based human pose estimation in crowded scenes is a challenging problem due to occlusion, motion blur, scale variation and viewpoint change, etc. Prior approaches always fail to deal with this problem because of (1) lacking of usage of temporal information; (2) lacking of training data in crowded scenes. In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data. In particular, we first follow the top-down strategy to detect persons and perform single-person pose estimation for each frame. Then, we refine the frame-based pose estimation with temporal contexts deriving from the optical-flow. Specifically, for one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos. In addition, we mine new data of similar scenes to HIE dataset from the Internet for improving the diversity of training set. In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average w\_AP on test dataset of HIE challenge.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform accurate human pose estimation in crowded - scene videos. Due to factors such as occlusion, motion blur, scale variation and perspective change, human pose estimation in videos is a very challenging problem in crowded scenes. Previous methods are usually difficult to deal with this problem for the following two reasons: (1) lack of utilization of temporal information; (2) lack of training data in crowded scenes. Therefore, from the perspective of using temporal context and collecting new data, this paper aims to improve the effect of human pose estimation in crowded - scene videos.