Abstract:Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is **to construct a unified model to handle multiple human - centered tasks**. Specifically, the author proposes a model named **UniHCP (Unified Model for Human - Centric Perceptions)**, aiming to handle multiple different human - centered tasks simultaneously through a single model, such as pose estimation, semantic segmentation, pedestrian detection, person re - identification (ReID), and attribute prediction, etc. ### Main Problems and Challenges 1. **Task Homogeneity and Diversity**: - Although different human perception tasks (such as pose estimation, semantic segmentation, pedestrian detection, etc.) have different semantic information, they all rely on the basic structure of the human body and the attributes of each part. - The traditional approach is to design a special model for each task, which leads to parameter redundancy and increased deployment complexity. 2. **Data Diversity and Output Structure**: - The data sets of different tasks have different resolutions and characteristics (such as day and night, indoor and outdoor), which pose challenges to the robustness and representativeness of the model. - The output structures and granularities of different tasks are also different, which makes it difficult to handle these tasks in a unified framework. ### Solutions To address the above challenges, the author proposes the following solutions: - **Unified Transformer Encoder - Decoder Architecture**: Use a simple visual Transformer as the basic architecture to handle the diversity of input images and extract general human perception features. - **Task - specific Queries**: Define specific queries for each task so as to focus on task - related features in the decoder. - **Task - guided Interpreter**: Decompose the outputs of different tasks into shared units (such as feature representations, local probability maps, global probabilities, bounding box coordinates), and generate the final output through the interpreter. - **Large - scale Multi - task Pre - training**: Conduct joint training on 33 large - scale human perception data sets, make full use of rich supervision signals, and improve the generalization ability and performance of the model. ### Experimental Results Through a large number of experimental verifications, UniHCP has achieved significant performance improvements in multiple human - centered tasks, and even surpassed the special - purpose models specifically designed for a certain task. For example: - On the CIHP data set, UniHCP has reached a segmentation accuracy of 69.8 mIoU. - On the PA - 100K data set, UniHCP has achieved an attribute prediction accuracy of 86.18 mA. - On the Market1501 data set, UniHCP has reached a re - identification accuracy of 90.3 mAP. - On the CrowdHuman data set, UniHCP has achieved a pedestrian detection performance of 85.8 JI. ### Summary The success of UniHCP demonstrates the feasibility of handling multiple human perception tasks through a unified model, which not only improves performance, but also reduces parameter redundancy and deployment complexity, providing new ideas and directions for future research.

UniHCP: A Unified Model for Human-Centric Perceptions

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

UniHuman: A Unified Model for Editing Human Images in the Wild

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

UniAR: A Unified model for predicting human Attention and Responses on visual content

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Unifying Visual Perception by Dispersible Points Learning

UniHead: Unifying Multi-Perception for Detection Heads

A Unified Framework for Human-centric Point Cloud Video Understanding

UniVision: A Unified Framework for Vision-Centric 3D Perception

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

UniParser: Multi-Human Parsing With Unified Correlation Representation Learning

Sapiens: Foundation for Human Vision Models