Deep Learning Technique for Human Parsing: A Survey and Outlook

Lu Yang,Wenhe Jia,Shan Li,Qing Song

DOI: https://doi.org/10.1007/s11263-024-02031-9

2024-03-14

Abstract:Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: <a class="link-external link-https" href="https://github.com/soeaver/awesome-human-parsing" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several core issues in human parsing. Specifically, the goal of human parsing is to segment the human body in images or videos into multiple pixel-level semantic parts. Despite significant achievements in human parsing solutions based on deep learning, there are still many important concepts, existing challenges, and potential research directions that need clarification. To this end, this paper provides a comprehensive review of three core sub-tasks: 1. **Single Human Parsing (SHP)**: - The goal is to supervise each pixel with the corresponding semantic category in the case of only one foreground human instance in the image. - The main challenges include large intra-class variation, unrestricted poses, and occlusion. 2. **Multiple Human Parsing (MHP)**: - The goal is to parse multiple different human instances in a single process, providing identity supervision for each person at the pixel level in addition to category information. - The core issue is how to distinguish different individuals in crowded scenes, comprehensively learn the features of each person, and improve inference efficiency. 3. **Video Human Parsing (VHP)**: - The goal is to parse each person in video data, which can be seen as a complex visual task combining video segmentation and image-level human parsing. - The main challenges include motion blur and camera position changes. Additionally, the paper proposes a transformer-based framework, providing a high-performance baseline for subsequent research and pointing out some under-explored open questions in the field, suggesting future research directions. Through these efforts, the paper aims to promote the sustainable development of the human parsing field.

Deep Learning Technique for Human Parsing: A Survey and Outlook

Self-supervised Structure-Sensitive Learning for Human Parsing

Devil in the Details: Towards Accurate Single and Multiple Human Parsing

From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing

Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing

Learning Semantic Neural Tree for Human Parsing

Semantic Human Parsing via Scalable Semantic Transfer over Multiple Label Domains

Deep Hierarchical Human Semantic Parsing

Deep Human Parsing with Active Template Regression

Multiple-Human Parsing in the Wild

Learning deep representations for semantic image parsing: a comprehensive overview

Fine-Grained Multi-human Parsing

Towards Real World Human Parsing: Multiple-Human Parsing in the Wild.

Parsing Objects at a Finer Granularity: A Survey

Human Parsing by Weak Structural Label

End-to-end One-shot Human Parsing

Video Scene Parsing: an Overview of Deep Learning Methods and Datasets

Renovating Parsing R-CNN for Accurate Multiple Human Parsing

Part Decomposition and Refinement Network for Human Parsing

Cross-domain Human Parsing Via Adversarial Feature and Label Adaptation

Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing