Abstract:Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code is available on <a class="link-external link-https" href="https://mirapurkrabek.github.io/BBox-Mask-Pose" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges in multi - body human pose estimation, especially the problems of inaccurate detection, segmentation and pose estimation caused by mutual occlusion among people in crowded scenes. Specifically: 1. **Pose estimation problems in multi - body scenes**: - Existing pose estimation methods perform well when dealing with a single human body, but not so well in multi - body scenes (such as crowded crowds). The main problem is that the occlusion between multiple people leads to the merging of bounding boxes or the collapse of poses. - The results on multi - body datasets are far from saturated, and the performance of the state - of - the - art models on datasets such as OCHuman is less than 50%. 2. **Limitations of existing methods**: - **Top - down methods**: They rely on the bounding boxes provided by the detector to estimate poses, but the detector may miss or mis - detect instances, especially in dense scenes. - **Detector - free methods**: They generate poses directly from images without relying on bounding boxes. Although they perform better in crowded scenes, they are not as good as top - down methods on datasets such as COCO. 3. **Combining bounding boxes, segmentation masks and pose estimation**: - Previous attempts improved pose estimation by conditioning on bounding boxes or bottom - up estimated poses, but ignored segmentation masks and their relationship with key points. - The paper proposes a new method - BBox - Mask - Pose (BMP), which integrates detection, segmentation and pose estimation into a self - improving feedback loop to improve performance in multi - body scenes. ### Main contributions of the BMP method 1. **Enhanced detector**: - Developed a detector that can ignore processed instances and is able to detect previously missed instances during the iteration process. 2. **MaskPose model**: - Introduced MaskPose, a pose estimation model conditioned on segmentation masks rather than bounding boxes, which improves robustness in dense scenes. 3. **BMP framework**: - Constructed a closed - loop system that combines bounding boxes, segmentation masks and pose estimation. By iteratively improving the output of each component, more consistent results and performance improvement are achieved, especially in multi - body scenes. Through these improvements, BMP achieves performance comparable to detector - free methods on the OCHuman dataset and outperforms existing top - down methods on the COCO dataset, combining the advantages of both and reaching the state - of - the - art level.

Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

The Best of Both Worlds: Combining Model-based and Nonparametric Approaches for 3D Human Body Estimation

AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time

Joint Multi-Person Body Detection and Orientation Estimation via One Unified Embedding

Out of the Box: A combined approach for handling occlusion in Human Pose Estimation

Pose2Seg: Detection Free Human Instance Segmentation

Optimization of Human Pose Detection Based on Mask RCNN

POSTURE: Pose Guided Unsupervised Domain Adaptation for Human Body Part Segmentation

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

BPJDet: Extended Object Representation for Generic Body-Part Joint Detection

Bottom-up Pose Estimation of Multiple Person with Bounding Box Constraint

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

A Human Body Part Semantic Segmentation Enabled Parsing for Human Pose Estimation

RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling

Joint Multi-Person Pose Estimation and Semantic Part Segmentation

Pose2Seg: Human Instance Segmentation Without Detection.