Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan,Pingping Zhang,Yuhao Wang,Silong Yong,Simon Stepputtis,Katia Sycara,Yaqi Xie
2024-09-13
Abstract:Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at <a class="link-external link-https" href="https://github.com/zifuwan/Sigma" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the challenges in Multi-modal Semantic Segmentation, particularly under adverse environmental conditions such as low light or overexposure, to improve the perception and scene understanding capabilities of AI agents. Specifically, the paper proposes a new method called Sigma, which leverages the advanced Mamba model to enhance the robustness and reliability of semantic segmentation by fusing traditional RGB images with other modalities (such as thermal imaging and depth information). ### Background and Motivation 1. **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: While they have linear complexity and scalability, their local receptive field is limited by the size of the convolution kernel, leading to reduced local information bias. - **Vision Transformers (ViT)**: Although they provide a global receptive field and dynamic weights, the self-attention mechanism results in quadratic complexity with respect to input size, leading to lower efficiency. 2. **Advantages of Multi-modal Information**: - Utilizing additional modalities (such as thermal imaging and depth information) can provide complementary information, enhancing the robustness and capability of the visual system. - However, the alignment and fusion of multi-modal information present new challenges. ### Proposed Method 1. **Sigma Model**: - **Siamese Mamba Encoder**: Used to extract features from different modalities. - **Fusion Module**: Effectively selects and fuses information from different modalities through Cross Mamba Block (CroMB) and Concat Mamba Block (ConMB). - **Channel-aware Decoder**: Enhances the model's ability to capture spatial and channel dimensions. 2. **Innovations**: - **First Application of State Space Model (SSM)**: Particularly the Mamba model, which has achieved success in multi-modal semantic segmentation tasks. - **Efficient Fusion Mechanism**: Achieves effective cross-modal information fusion through the Mamba model. - **Comprehensive Experimental Validation**: Demonstrates superior accuracy and efficiency on RGB-thermal and RGB-depth datasets. ### Experimental Results 1. **Quantitative Analysis**: - On the MFNet and PST900 datasets, the Sigma model significantly outperforms other methods with lower model parameters and computational complexity. - Particularly on the PST900 dataset, performance improved by over 2%. 2. **Qualitative Analysis**: - Through visualized results, the Sigma model excels in generating more comprehensive segmentation and accurate classification, especially in identifying complex features (such as tactile paving and guardrails). 3. **Ablation Study**: - By removing different modules (such as CroMB and ConMB), the effectiveness of each component was verified. Results show that these modules are crucial for overall performance improvement. ### Conclusion The Sigma model successfully addresses the challenges in multi-modal semantic segmentation by introducing the advanced Mamba model and innovative fusion mechanisms, demonstrating outstanding robustness and accuracy, particularly under adverse environmental conditions. This approach provides a new benchmark for future multi-modal learning research.