Abstract:The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.

Preserving Details in Semantics-Aware Context for Scene Parsing.

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Filling the Gaps in Atrous Convolution: Semantic Segmentation with a Better Context

Toward Achieving Robust Low-Level and High-Level Scene Parsing

Research of improving semantic image segmentation based on a feature fusion model

Interaction via Bi-directional Graph of Semantic Region Affinity for Scene Parsing

Using Features Specifically: an Efficient Network for Scene Segmentation Based on Dedicated Attention Mechanisms

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

Differentiating Features for Scene Segmentation Based on Dedicated Attention Mechanisms

Compensating for Local Ambiguity With Encoder-Decoder in Urban Scene Segmentation

Long and short-range relevance context network for semantic segmentation

Context Encoding for Semantic Segmentation

Attention-guided chained context aggregation for semantic segmentation

Semantic Segmentation Via Structured Patch Prediction, Context Crf And Guidance Crf

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A Learnable Scene Descriptor.

Context-Integrated and Feature-Refined Network for Lightweight Object Parsing

Channel and Spatial Enhancement Network for human parsing

Built-in Depth-Semantic Coupled Encoding for Scene Parsing, Vehicle Detection, and Road Segmentation

Dual Context Prior and Refined Prediction for Semantic Segmentation

Semantics Recalibration and Detail Enhancement Network for Real‐time Semantic Segmentation