Referring Human Pose and Mask Estimation in the Wild

Bo Miao,Mingtao Feng,Zijie Wu,Mohammed Bennamoun,Yongsheng Gao,Ajmal Mian
2024-10-28
Abstract:We introduce Referring Human Pose and Mask Estimation (R-HPM) in the wild, where either a text or positional prompt specifies the person of interest in an image. This new task holds significant potential for human-centric applications such as assistive robotics and sports analysis. In contrast to previous works, R-HPM (i) ensures high-quality, identity-aware results corresponding to the referred person, and (ii) simultaneously predicts human pose and mask for a comprehensive representation. To achieve this, we introduce a large-scale dataset named RefHuman, which substantially extends the MS COCO dataset with additional text and positional prompt annotations. RefHuman includes over 50,000 annotated instances in the wild, each equipped with keypoint, mask, and prompt annotations. To enable prompt-conditioned estimation, we propose the first end-to-end promptable approach named UniPHD for R-HPM. UniPHD extracts multimodal representations and employs a proposed pose-centric hierarchical decoder to process (text or positional) instance queries and keypoint queries, producing results specific to the referred person. Extensive experiments demonstrate that UniPHD produces quality results based on user-friendly prompts and achieves top-tier performance on RefHuman val and MS COCO val2017. Data and Code: <a class="link-external link-https" href="https://github.com/bo-miao/RefHuman" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately predict the human pose and mask of a specified individual through natural - friendly text or location prompts in complex and unrestricted environments. Specifically, the paper proposes a new task - **Referring Human Pose and Mask Estimation (R - HPM)**, aiming to overcome the limitations of existing methods in multi - person pose estimation: 1. **Identity - aware results**: Ensure that the results correspond to the specified individual. 2. **Simultaneous prediction of pose and mask**: Provide a more comprehensive human representation. ### Background and Problem Description Existing multi - person pose estimation methods are usually divided into two categories: - **Top - down methods**: First detect the human bounding box, and then perform pose estimation on the cropped single - person image. This method has additional detection steps, regional operations and independent training models, resulting in high computational cost and non - end - to - end. - **Bottom - up methods**: First predict a large number of instance - independent key points, and then group them into individual poses through heuristic algorithms. These complex grouping algorithms introduce manually - designed parameters and are difficult to handle complex scenes such as occlusion. In addition, when these methods are deployed, strategies need to be designed to select the best - matched target individual, which may lead to sub - optimal results or false negatives. More importantly, they lack support for human - AI interaction, cannot directly predict the expected results according to natural prompts, and ignore the joint human pose and mask estimation, which is very important in applications such as assistive robots and motion analysis. ### Solutions in the Paper To solve the above problems, the paper proposes the following innovations: 1. **New task R - HPM**: Through user - friendly text or location prompts, simultaneously predict the pose and mask of the specified individual, providing identity - awareness and comprehensive human representation. 2. **RefHuman dataset**: Expands the MS COCO dataset, containing more than 50,000 annotated instances, each equipped with key point, mask and prompt annotations. 3. **UniPHD model**: Proposes an end - to - end prompt - conditional model that can handle text or location prompts, and generates results for the specified individual through multi - modal representation and the proposed pose - centric hierarchical decoder. ### Main Contributions - Proposed the new task of Referring Human Pose and Mask Estimation (R - HPM), enhancing the identity - awareness ability of the model, providing a comprehensive human representation, which is beneficial to human - AI interaction. - Introduced the RefHuman dataset, significantly expanding MS COCO, including pose and mask annotations in diverse unrestricted environments, and equipped with corresponding text and location prompts. - Proposed the UniPHD end - to - end prompt - conditional model, achieving top - level performance and establishing a benchmark for future research. Through these innovations, the paper solves the limitations of existing methods in multi - person pose estimation and promotes the development of the human pose estimation field.