Deep Learning Technique for Human Parsing: A Survey and Outlook

Lu Yang,Wenhe Jia,Shan Li,Qing Song
DOI: https://doi.org/10.1007/s11263-024-02031-9
2024-03-14
Abstract:Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: <a class="link-external link-https" href="https://github.com/soeaver/awesome-human-parsing" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several core issues in human parsing. Specifically, the goal of human parsing is to segment the human body in images or videos into multiple pixel-level semantic parts. Despite significant achievements in human parsing solutions based on deep learning, there are still many important concepts, existing challenges, and potential research directions that need clarification. To this end, this paper provides a comprehensive review of three core sub-tasks: 1. **Single Human Parsing (SHP)**: - The goal is to supervise each pixel with the corresponding semantic category in the case of only one foreground human instance in the image. - The main challenges include large intra-class variation, unrestricted poses, and occlusion. 2. **Multiple Human Parsing (MHP)**: - The goal is to parse multiple different human instances in a single process, providing identity supervision for each person at the pixel level in addition to category information. - The core issue is how to distinguish different individuals in crowded scenes, comprehensively learn the features of each person, and improve inference efficiency. 3. **Video Human Parsing (VHP)**: - The goal is to parse each person in video data, which can be seen as a complex visual task combining video segmentation and image-level human parsing. - The main challenges include motion blur and camera position changes. Additionally, the paper proposes a transformer-based framework, providing a high-performance baseline for subsequent research and pointing out some under-explored open questions in the field, suggesting future research directions. Through these efforts, the paper aims to promote the sustainable development of the human parsing field.