Abstract:We introduce Referring Human Pose and Mask Estimation (R-HPM) in the wild, where either a text or positional prompt specifies the person of interest in an image. This new task holds significant potential for human-centric applications such as assistive robotics and sports analysis. In contrast to previous works, R-HPM (i) ensures high-quality, identity-aware results corresponding to the referred person, and (ii) simultaneously predicts human pose and mask for a comprehensive representation. To achieve this, we introduce a large-scale dataset named RefHuman, which substantially extends the MS COCO dataset with additional text and positional prompt annotations. RefHuman includes over 50,000 annotated instances in the wild, each equipped with keypoint, mask, and prompt annotations. To enable prompt-conditioned estimation, we propose the first end-to-end promptable approach named UniPHD for R-HPM. UniPHD extracts multimodal representations and employs a proposed pose-centric hierarchical decoder to process (text or positional) instance queries and keypoint queries, producing results specific to the referred person. Extensive experiments demonstrate that UniPHD produces quality results based on user-friendly prompts and achieves top-tier performance on RefHuman val and MS COCO val2017. Data and Code: <a class="link-external link-https" href="https://github.com/bo-miao/RefHuman" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accurately predict the human pose and mask of a specified individual through natural - friendly text or location prompts in complex and unrestricted environments. Specifically, the paper proposes a new task - **Referring Human Pose and Mask Estimation (R - HPM)**, aiming to overcome the limitations of existing methods in multi - person pose estimation: 1. **Identity - aware results**: Ensure that the results correspond to the specified individual. 2. **Simultaneous prediction of pose and mask**: Provide a more comprehensive human representation. ### Background and Problem Description Existing multi - person pose estimation methods are usually divided into two categories: - **Top - down methods**: First detect the human bounding box, and then perform pose estimation on the cropped single - person image. This method has additional detection steps, regional operations and independent training models, resulting in high computational cost and non - end - to - end. - **Bottom - up methods**: First predict a large number of instance - independent key points, and then group them into individual poses through heuristic algorithms. These complex grouping algorithms introduce manually - designed parameters and are difficult to handle complex scenes such as occlusion. In addition, when these methods are deployed, strategies need to be designed to select the best - matched target individual, which may lead to sub - optimal results or false negatives. More importantly, they lack support for human - AI interaction, cannot directly predict the expected results according to natural prompts, and ignore the joint human pose and mask estimation, which is very important in applications such as assistive robots and motion analysis. ### Solutions in the Paper To solve the above problems, the paper proposes the following innovations: 1. **New task R - HPM**: Through user - friendly text or location prompts, simultaneously predict the pose and mask of the specified individual, providing identity - awareness and comprehensive human representation. 2. **RefHuman dataset**: Expands the MS COCO dataset, containing more than 50,000 annotated instances, each equipped with key point, mask and prompt annotations. 3. **UniPHD model**: Proposes an end - to - end prompt - conditional model that can handle text or location prompts, and generates results for the specified individual through multi - modal representation and the proposed pose - centric hierarchical decoder. ### Main Contributions - Proposed the new task of Referring Human Pose and Mask Estimation (R - HPM), enhancing the identity - awareness ability of the model, providing a comprehensive human representation, which is beneficial to human - AI interaction. - Introduced the RefHuman dataset, significantly expanding MS COCO, including pose and mask annotations in diverse unrestricted environments, and equipped with corresponding text and location prompts. - Proposed the UniPHD end - to - end prompt - conditional model, achieving top - level performance and establishing a benchmark for future research. Through these innovations, the paper solves the limitations of existing methods in multi - person pose estimation and promotes the development of the human pose estimation field.

Referring Human Pose and Mask Estimation in the Wild

Context-Guided Adaptive Network for Efficient Human Pose Estimation.

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

HSMR: A Head-Shoulder Mask Aided ResNet to Guide Focus of Re-Identification Implemented on Tour-Guide Robot.

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

X-Pose: Detecting Any Keypoints

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

Keypoint Promptable Re-Identification

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Human De-occlusion: Invisible Perception and Recovery for Humans

AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time

Beyond Human Detection: A Benchmark for Detecting Common Human Posture

RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling

ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

UniHCP: A Unified Model for Human-Centric Perceptions

ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification