Abstract:The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.

Boosting Scene Parsing Performance via Reliable Scale Prediction.

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Scale-Adaptive Convolutions for Scene Parsing

CaseNet: Content-Adaptive Scale Interaction Networks for Scene Parsing

CASINet: Content-Adaptive Scale Interaction Networks for Scene Parsing

ScaleNet: Searching for the Model to Scale.

SPGNet: Semantic Prediction Guidance for Scene Parsing

Graph-Based Scale-Aware Network for Human Parsing

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Channel and Spatial Enhancement Network for human parsing

SPNet: Superpixel Pyramid Network for Scene Parsing

Toward Achieving Robust Low-Level and High-Level Scene Parsing

Multi-Timescale Context Encoding for Scene Parsing Prediction

Objectness Region Enhancement Networks for Scene Parsing

Efficient Scale-Permuted Backbone with Learned Resource Distribution

NSSNet: Scale-Aware Object Counting With Non-Scale Suppression

Multi-layer Feature Aggregation for Deep Scene Parsing Models

Scale-Recursive Network with Point Supervision for Crowd Scene Analysis

ScaleNet: Guiding Object Proposal Generation in Supermarkets and Beyond

MCFNet: Multi-Attentional Class Feature Augmentation Network for Real-Time Scene Parsing