Abstract:Visual perception is a crucial component of autonomous driving systems. Traditional approaches for autonomous driving visual perception often rely on single-modal methods, and semantic segmentation tasks are accomplished by inputting RGB images. However, for semantic segmentation tasks in autonomous driving visual perception, a more effective strategy involves leveraging multiple modalities, which is because different sensors of the autonomous driving system bring diverse information, and the complementary features among different modalities enhance the robustness of the semantic segmentation modal. Contrary to the intuitive belief that more modalities lead to better accuracy, our research reveals that adding modalities to traditional semantic segmentation models can sometimes decrease precision. Inspired by the residual thinking concept, we propose a multimodal visual perception model which is capable of maintaining or even improving accuracy with the addition of any modality. Our approach is straightforward, using RGB as the main branch and employing the same feature extraction backbone for other modal branches. The modals score module (MSM) evaluates channel and spatial scores of all modality features, measuring their importance for overall semantic segmentation. Subsequently, the modal branches provide additional features to the RGB main branch through the features complementary module (FCM). Leveraging the residual thinking concept further enhances the feature extraction capabilities of all the branches. Through extensive experiments, we derived several conclusions. The integration of certain modalities into traditional semantic segmentation models tends to result in a decline in segmentation accuracy. In contrast, our proposed simple and scalable multimodal model demonstrates the ability to maintain segmentation precision when accommodating any additional modality. Moreover, our approach surpasses some state-of-the-art multimodal semantic segmentation models. Additionally, we conducted ablation experiments on the proposed model, confirming that the application of the proposed MSM, FCM, and the incorporation of residual thinking contribute significantly to the enhancement of the model.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of multi-modal semantic segmentation in autonomous driving vision perception systems. Traditional autonomous driving vision perception methods often rely on single-modal approaches and typically perform semantic segmentation tasks by inputting RGB images. However, in the context of semantic segmentation tasks for autonomous driving, utilizing different modal information provided by various sensors can bring complementary features, thereby enhancing the robustness of the semantic segmentation model. Although it is intuitively believed that increasing the number of modalities can improve accuracy, the authors' research found that in some cases, adding modalities to traditional semantic segmentation models might actually reduce accuracy. To solve this problem, the authors, inspired by the concept of residual thinking, proposed a multi-modal vision perception model that can maintain or even improve accuracy when any new modality is added. Specifically, the model uses RGB as the main branch and employs the same feature extraction backbone network for other modalities. The Modality Score Module (MSM) evaluates the channel and spatial scores of all modal features to measure their importance to the overall semantic segmentation. Then, through the Feature Complement Module (FCM), the features of additional modalities are supplemented into the RGB main branch, further enhancing the feature extraction capability of the entire model. Experimental results show that the model not only maintains segmentation accuracy after adding new modalities but also, in some cases, surpasses existing multi-modal semantic segmentation models. In summary, this paper aims to address the issue of how to effectively integrate multiple modal information to improve the performance of semantic segmentation in autonomous driving vision perception systems.

Simple Scalable Multimodal Semantic Segmentation Model

A Scalable Real-time Semantic Segmentation Network for Autonomous Driving

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

A Joint Object Detection and Semantic Segmentation Model with Cross-Attention and Inner-Attention Mechanisms

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

Mitigating Modality Discrepancies for RGB-T Semantic Segmentation

Semantic segmentation of autonomous driving scenes based on multi-scale adaptive attention mechanism

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

Semantic Reconstruction based on RGB Image and Sparse Depth

MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Object Segmentation by Mining Cross-Modal Semantics

Multi-Modal Prototypes for Open-World Semantic Segmentation

Segment Anything with Multiple Modalities

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images

MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation

A Cyclic Information–Interaction Model for Remote Sensing Image Segmentation