Abstract:RGB-Thermal (RGB-T) semantic segmentation has shown great potential in handling low-light conditions where RGB-based segmentation is hindered by poor RGB imaging quality. The key to RGB-T semantic segmentation is to effectively leverage the complementarity nature of RGB and thermal images. Most existing algorithms fuse RGB and thermal information in feature space via concatenation, element-wise summation, or attention operations in either unidirectional enhancement or bidirectional aggregation manners. However, they usually overlook the modality gap between RGB and thermal images during feature fusion, resulting in modality-specific information from one modality contaminating the other. In this paper, we propose a Channel and Spatial Relation-Propagation Network (CSRPNet) for RGB-T semantic segmentation, which propagates only modality-shared information across different modalities and alleviates the modality-specific information contamination issue. Our CSRPNet first performs relation-propagation in channel and spatial dimensions to capture the modality-shared features from the RGB and thermal features. CSRPNet then aggregates the modality-shared features captured from one modality with the input feature from the other modality to enhance the input feature without the contamination issue. While being fused together, the enhanced RGB and thermal features will be also fed into the subsequent RGB or thermal feature extraction layers for interactive feature fusion, respectively. We also introduce a dual-path cascaded feature refinement module that aggregates multi-layer features to produce two refined features for semantic and boundary prediction. Extensive experimental results demonstrate that CSRPNet performs favorably against state-of-the-art algorithms.

What problem does this paper attempt to address?

The paper primarily addresses the key challenge in the RGB-Thermal (RGB-T) semantic segmentation task—how to effectively fuse information from RGB images and thermal infrared images to overcome the differences between the two modalities (i.e., the modality gap), and proposes a new solution. ### Problems the Paper Aims to Solve 1. **Modality Gap**: There are inherent differences between RGB images and thermal infrared images, which lead to modality-specific information pollution when directly fusing the two modalities. This means that information unique to one modality can contaminate the information of the other modality. 2. **Limitations of Existing Methods**: Most current RGB-T semantic segmentation algorithms typically use unidirectional enhancement or bidirectional aggregation when fusing RGB and thermal infrared features. These methods often overlook the modality gap between RGB and thermal infrared images, resulting in modality-specific information pollution in the fused features. ### Solution Overview To address the above issues, the paper proposes a new architecture called the "Channel and Spatial Relation-Propagation Network" (CSRPNet). The core idea of CSRPNet is to first extract features containing modality-shared information through a "relation propagation" technique before fusing different modality features. Then, only these shared features are used for interactive multi-modal fusion, thereby avoiding modality-specific information pollution. ### Key Technical Points - **Channel and Spatial Relation-Propagation Module**: This module first calculates the inter-channel and inter-pixel relation matrices between RGB and thermal infrared image features. It then captures modality-shared features through matrix operations and uses these features to enhance the input features, achieving interactive fusion. - **Dual-path Cascaded Feature Refinement Module**: To fully utilize multi-layer fused features, the paper also designs a Dual-path Cascaded Feature Refinement (DCFR) module. This module generates two refined feature maps through two paths, used for boundary prediction and semantic prediction. ### Summary In summary, the paper aims to address the modality gap issue in the RGB-T semantic segmentation task through CSRPNet, ensuring that the characteristics of each modality are preserved during the fusion of RGB and thermal infrared image features, thereby improving segmentation accuracy and robustness.

Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation

RGB-T Semantic Segmentation with Location, Activation, and Sharpening

A Feature Divide-and-Conquer Network for RGB-T Semantic Segmentation

C4Net: Excavating Cross-modal Context- and Content-Complementarity for RGB-T Semantic Segmentation

Dual-branch deep cross-modal interaction network for semantic segmentation with thermal images

An RGB-D Fusion Based Semantic Segmentation Algorithm Based on Neighborhood Metric Relations

Region-adaptive and context-complementary cross modulation for RGB-T semantic segmentation

Mask-guided Modality Difference Reduction Network for RGB-T Semantic Segmentation

DCFNet: Dense Complementary Fusion for RGB-Thermal Urban Scene Perception

Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation

Multispectral Fusion Transformer Network for RGB-Thermal Urban Scene Semantic Segmentation

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Global-Local Propagation Network for RGB-D Semantic Segmentation

Mitigating Modality Discrepancies for RGB-T Semantic Segmentation

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

Specificity-preserving RGB-D Saliency Detection

Bimodal Feature Propagation and Fusion for Real-time Semantic Segmentation on RGB-D Images

ABMDRNet: Adaptive-weighted Bi-directional Modality Difference Reduction Network for RGB-T Semantic Segmentation