Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Miroslav Purkrabek,Jiri Matas
2024-12-02
Abstract:Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code is available on <a class="link-external link-https" href="https://mirapurkrabek.github.io/BBox-Mask-Pose" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges in multi - body human pose estimation, especially the problems of inaccurate detection, segmentation and pose estimation caused by mutual occlusion among people in crowded scenes. Specifically: 1. **Pose estimation problems in multi - body scenes**: - Existing pose estimation methods perform well when dealing with a single human body, but not so well in multi - body scenes (such as crowded crowds). The main problem is that the occlusion between multiple people leads to the merging of bounding boxes or the collapse of poses. - The results on multi - body datasets are far from saturated, and the performance of the state - of - the - art models on datasets such as OCHuman is less than 50%. 2. **Limitations of existing methods**: - **Top - down methods**: They rely on the bounding boxes provided by the detector to estimate poses, but the detector may miss or mis - detect instances, especially in dense scenes. - **Detector - free methods**: They generate poses directly from images without relying on bounding boxes. Although they perform better in crowded scenes, they are not as good as top - down methods on datasets such as COCO. 3. **Combining bounding boxes, segmentation masks and pose estimation**: - Previous attempts improved pose estimation by conditioning on bounding boxes or bottom - up estimated poses, but ignored segmentation masks and their relationship with key points. - The paper proposes a new method - BBox - Mask - Pose (BMP), which integrates detection, segmentation and pose estimation into a self - improving feedback loop to improve performance in multi - body scenes. ### Main contributions of the BMP method 1. **Enhanced detector**: - Developed a detector that can ignore processed instances and is able to detect previously missed instances during the iteration process. 2. **MaskPose model**: - Introduced MaskPose, a pose estimation model conditioned on segmentation masks rather than bounding boxes, which improves robustness in dense scenes. 3. **BMP framework**: - Constructed a closed - loop system that combines bounding boxes, segmentation masks and pose estimation. By iteratively improving the output of each component, more consistent results and performance improvement are achieved, especially in multi - body scenes. Through these improvements, BMP achieves performance comparable to detector - free methods on the OCHuman dataset and outperforms existing top - down methods on the COCO dataset, combining the advantages of both and reaching the state - of - the - art level.