Abstract:Global context information is vital in visual understanding problems, especially in pixel-level semantic segmentation. The mainstream methods adopt the self-attention mechanism to model global context information. However, pixels belonging to different classes usually have weak feature correlation. Modeling the global pixel-level correlation matrix indiscriminately is extremely redundant in the self-attention mechanism. In order to solve the above problem, we propose a hierarchical context network to differentially model homogeneous pixels with strong correlations and heterogeneous pixels with weak correlations. Specifically, we first propose a multi-scale guided pre-segmentation module to divide the entire feature map into different classed-based homogeneous regions. Within each homogeneous region, we design the pixel context module to capture pixel-level correlations. Subsequently, different from the self-attention mechanism that still models weak heterogeneous correlations in a dense pixel-level manner, the region context module is proposed to model sparse region-level dependencies using a unified representation of each region. Through aggregating fine-grained pixel context features and coarse-grained region context features, our proposed network can not only hierarchically model global context information but also harvest multi-granularity representations to more robustly identify multi-scale objects. We evaluate our approach on Cityscapes and the ISPRS Vaihingen dataset. Without Bells or Whistles, our approach realizes a mean IoU of 82.8% and overall accuracy of 91.4% on Cityscapes and ISPRS Vaihingen test set, achieving state-of-the-art results.

THCANet: Two-layer hop cascaded asymptotic network for robot-driving road-scene semantic segmentation in RGB-D images

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation.

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

Context-Aware Interaction Network for RGB-T Semantic Segmentation

Context Aggregation Network for Remote Sensing Image Semantic Segmentation

D-CANet: Diverse Class-Aware Coding and Decoding Structure Network for Semantic Segmentation of High-Resolution Remote Sensing Images

High-Resolution Remote Sensing Image Semantic Segmentation Method Based on Improved Encoder-Decoder Convolutional Neural Network

MTANet: Multitask-Aware Network with Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding

DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation

MFCANet: A Road Scene Segmentation Network Based on Multi-Scale Feature Fusion and Context Information Aggregation

Attention-guided chained context aggregation for semantic segmentation

Adaptive multi-scale dual attention network for semantic segmentation

Interactive Context-Aware Network for RGB-T Salient Object Detection

Hybridizing Cross-Level Contextual and Attentive Representations for Remote Sensing Imagery Semantic Segmentation

Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation

DCANet: Dense Context-Aware Network for Semantic Segmentation

CI-Net: a joint depth estimation and semantic segmentation network using contextual information

LinkNet: 2D-3D Linked Multi-Modal Network for Online Semantic Segmentation of RGB-D Videos

HCNet: Hierarchical Context Network for Semantic Segmentation