Abstract:Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at <a class="link-external link-https" href="https://github.com/zifuwan/Sigma" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the challenges in Multi-modal Semantic Segmentation, particularly under adverse environmental conditions such as low light or overexposure, to improve the perception and scene understanding capabilities of AI agents. Specifically, the paper proposes a new method called Sigma, which leverages the advanced Mamba model to enhance the robustness and reliability of semantic segmentation by fusing traditional RGB images with other modalities (such as thermal imaging and depth information). ### Background and Motivation 1. **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: While they have linear complexity and scalability, their local receptive field is limited by the size of the convolution kernel, leading to reduced local information bias. - **Vision Transformers (ViT)**: Although they provide a global receptive field and dynamic weights, the self-attention mechanism results in quadratic complexity with respect to input size, leading to lower efficiency. 2. **Advantages of Multi-modal Information**: - Utilizing additional modalities (such as thermal imaging and depth information) can provide complementary information, enhancing the robustness and capability of the visual system. - However, the alignment and fusion of multi-modal information present new challenges. ### Proposed Method 1. **Sigma Model**: - **Siamese Mamba Encoder**: Used to extract features from different modalities. - **Fusion Module**: Effectively selects and fuses information from different modalities through Cross Mamba Block (CroMB) and Concat Mamba Block (ConMB). - **Channel-aware Decoder**: Enhances the model's ability to capture spatial and channel dimensions. 2. **Innovations**: - **First Application of State Space Model (SSM)**: Particularly the Mamba model, which has achieved success in multi-modal semantic segmentation tasks. - **Efficient Fusion Mechanism**: Achieves effective cross-modal information fusion through the Mamba model. - **Comprehensive Experimental Validation**: Demonstrates superior accuracy and efficiency on RGB-thermal and RGB-depth datasets. ### Experimental Results 1. **Quantitative Analysis**: - On the MFNet and PST900 datasets, the Sigma model significantly outperforms other methods with lower model parameters and computational complexity. - Particularly on the PST900 dataset, performance improved by over 2%. 2. **Qualitative Analysis**: - Through visualized results, the Sigma model excels in generating more comprehensive segmentation and accurate classification, especially in identifying complex features (such as tactile paving and guardrails). 3. **Ablation Study**: - By removing different modules (such as CroMB and ConMB), the effectiveness of each component was verified. Results show that these modules are crucial for overall performance improvement. ### Conclusion The Sigma model successfully addresses the challenges in multi-modal semantic segmentation by introducing the advanced Mamba model and innovative fusion mechanisms, demonstrating outstanding robustness and accuracy, particularly under adverse environmental conditions. This approach provides a new benchmark for future multi-modal learning research.

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation

RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation

RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation

MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification

Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

PPMamba: A Pyramid Pooling Local Auxiliary SSM-Based Model for Remote Sensing Image Semantic Segmentation

A Mamba-Diffusion Framework for Multimodal Remote Sensing Image Semantic Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

On Exploring Shape and Semantic Enhancements for RGB-X Semantic Segmentation

Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

Simple Scalable Multimodal Semantic Segmentation Model

PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery