Abstract:The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.

Lightweight cross-guided contextual perceptive network for visible–infrared urban road scene parsing

A Saliency-Aware Deep Network for Narrow Road Extraction of High-Resolution Remote Sensing Imagery

FoveaNet: Perspective-Aware Urban Scene Parsing

EFRNet: Efficient Feature Reconstructing Network for Real-Time Scene Parsing

C2Net: Road Extraction via Context Perception and Cross Spatial-Scale Feature Interaction

NDNet: Spacewise Multiscale Representation Learning via Neighbor Decoupling for Real-Time Driving Scene Parsing

Channel and Spatial Enhancement Network for human parsing

Misalignment fusion network for parsing infrared and visible urban scenes

Cross-CBAM: a lightweight network for real-time scene segmentation

EKENet: Efficient knowledge enhanced network for real-time scene parsing

A Real-Time Scene Parsing Network for Autonomous Maritime Transportation

Cross-CBAM: A Lightweight network for Scene Segmentation

Fast and Accurate Scene Parsing Via Bi-direction Alignment Networks

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Unified Perceptual Parsing for Scene Understanding

Context-Adaptive Deep Learning for Efficient Image Parsing in Remote Sensing: An Automated Parameter Selection Approach

C2S-RoadNet: Road Extraction Model with Depth-Wise Separable Convolution and Self-Attention

TransRoadNet: A Novel Road Extraction Method for Remote Sensing Images via Combining High-Level Semantic Feature and Context

LightFGCNet: A Lightweight and Focusing on Global Context Information Semantic Segmentation Network for Remote Sensing Imagery

LFFNet: lightweight feature-enhanced fusion network for real-time semantic segmentation of road scenes