Abstract:The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.

Keypoint based weakly supervised human parsing

Learning hierarchical poselets for human parsing

Human Parsing by Weak Structural Label

Weakly and Semi Supervised Human Body Part Parsing Via Pose-Guided Knowledge Transfer

Channel and Spatial Enhancement Network for human parsing

Learning Semisupervised Multilabel Fully Convolutional Network for Hierarchical Object Parsing.

Classification Assisted Segmentation Network For Human Parsing

Human Parsing with Contextualized Convolutional Neural Network.

Self-supervised Structure-Sensitive Learning for Human Parsing

Hierarchical Information Passing Based Noise-Tolerant Hybrid Learning for Semi-Supervised Human Parsing.

Matching-CNN Meets KNN: Quasi-Parametric Human Parsing

Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

Combining Parsing Information with Joint Structure for Human Pose Estimation.

Hybrid Resolution Network Using Edge Guided Region Mutual Information Loss for Human Parsing

Deep Human Parsing with Active Template Regression

From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing

Weakly-supervised Scene Parsing with Multiple Contextual Cues

Part Decomposition and Refinement Network for Human Parsing

FC FC Squared Loss on Image-Level Labels + + + + + 0 Segmenta Ti on Goundtruth Global Image-level Context Image

Pose-Guided Human Parsing with Deep Learned Features

Attention-guided Progressive Partition Network for Human Parsing