Abstract:Human parsing, especially in the wild, has attracted a lot of attention due to its great potential in many real-world applications. The Pyramid Spatial Parsing (PSP) module has shown superior performances in scene and human parsing tasks. However, the basic AvgPool operation in PSP equally aggregates spatial clues of a local region, and thus mixes up influences of different human parts presented in this region. It results in failures in capturing useful contexts relevant to parsing different parts. To address this problem, a suitable mechanism to collect spatial clues aligning with different human parts is proposed in this paper. We employ a Gather-Excite (GE) operation, a replacement of the AvgPool-Upsample operation in a pyramidical structure, to accurately reflect relevant human parts of various scales. The GE operation contains two steps: the gather operation that adaptively aggregates spatial clues to relevant human parts, and the excite operation that generates new feature maps with the gathered contextual information. This results in a novel Pyramidical Gather-Excite Context (PGEC) module to solve the multi-scale problem and parse person at various scales. The PGEC module is composed of multiple GE operations with different spatial extents and aggregates local and global spatial clues for better modeling multi-scale contextual information in parallel. Moreover, we integrate the PGEC module with fine-grained details, edge preserving module and deep supervision to formulate a novel PGEC Network (PGECNet) for human parsing. The proposed PGECNet has achieved state-of-the-art performance on four single-person human parsing datasets (i.e., LIP, PPSS, ATR and Fashion Clothing) and two multi-person human parsing datasets (i.e., PASCAL-Person-Part and CIHP). The experimental results show that the proposed PGEC is superior to the PSP and ASPP modules especially in single-human parsing task. The source code is publicly available at https://github.com/31sy/PGECNet.

Channel and Spatial Enhancement Network for human parsing

Multi-layer Feature Aggregation for Deep Scene Parsing Models

MoE-SPNet: A Mixture-of-experts Scene Parsing Network.

From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing

Class Semantic Enhancement Network for Semantic Segmentation

Quality-Aware Network for Human Parsing

Fast and Accurate Scene Parsing Via Bi-direction Alignment Networks

CaseNet: Content-Adaptive Scale Interaction Networks for Scene Parsing

Consensus Feature Network for Scene Parsing

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing

Toward Accurate Human Parsing Through Edge Guided Diffusion

Learning Cross-Channel Representations for Semantic Segmentation

EKENet: Efficient knowledge enhanced network for real-time scene parsing

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

EFRNet: Efficient Feature Reconstructing Network for Real-Time Scene Parsing

Learning Semantic Neural Tree for Human Parsing

Lightweight cross-guided contextual perceptive network for visible–infrared urban road scene parsing

Feature boosting with efficient attention for scene parsing

Scene Parsing Using Inference Embedded Deep Networks

Human Parsing With Pyramidical Gather-Excite Context