Abstract:Visual perception is a crucial component of autonomous driving systems. Traditional approaches for autonomous driving visual perception often rely on single-modal methods, and semantic segmentation tasks are accomplished by inputting RGB images. However, for semantic segmentation tasks in autonomous driving visual perception, a more effective strategy involves leveraging multiple modalities, which is because different sensors of the autonomous driving system bring diverse information, and the complementary features among different modalities enhance the robustness of the semantic segmentation modal. Contrary to the intuitive belief that more modalities lead to better accuracy, our research reveals that adding modalities to traditional semantic segmentation models can sometimes decrease precision. Inspired by the residual thinking concept, we propose a multimodal visual perception model which is capable of maintaining or even improving accuracy with the addition of any modality. Our approach is straightforward, using RGB as the main branch and employing the same feature extraction backbone for other modal branches. The modals score module (MSM) evaluates channel and spatial scores of all modality features, measuring their importance for overall semantic segmentation. Subsequently, the modal branches provide additional features to the RGB main branch through the features complementary module (FCM). Leveraging the residual thinking concept further enhances the feature extraction capabilities of all the branches. Through extensive experiments, we derived several conclusions. The integration of certain modalities into traditional semantic segmentation models tends to result in a decline in segmentation accuracy. In contrast, our proposed simple and scalable multimodal model demonstrates the ability to maintain segmentation precision when accommodating any additional modality. Moreover, our approach surpasses some state-of-the-art multimodal semantic segmentation models. Additionally, we conducted ablation experiments on the proposed model, confirming that the application of the proposed MSM, FCM, and the incorporation of residual thinking contribute significantly to the enhancement of the model.

Multimodal Semantic Segmentation in Autonomous Driving: A Review of Current Approaches and Future Perspectives

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

Vision-Based Semantic Segmentation in Scene Understanding for Autonomous Driving: Recent Achievements, Challenges, and Outlooks

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

Real-Time Semantic Image Segmentation with Deep Learning for Autonomous Driving: A Survey

Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented, Temporal and Depth-aware design

Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges

Continual Road-Scene Semantic Segmentation via Feature-Aligned Symmetric Multi-Modal Network

A Review on Deep Learning Techniques Applied to Semantic Segmentation

A Survey of 3D Point Cloud and Deep Learning-Based Approaches for Scene Understanding in Autonomous Driving

Multimodal End-to-End Autonomous Driving

Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection

Automated Evaluation of Semantic Segmentation Robustness for Autonomous Driving

Semantic segmentation of autonomous driving scenes based on multi-scale adaptive attention mechanism

Simple Scalable Multimodal Semantic Segmentation Model

Multi-Modal 3D Object Detection in Autonomous Driving: A Survey

MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Enhanced Perception for Autonomous Driving Using Semantic and Geometric Data Fusion

Unsupervised Domain Adaptation in Semantic Segmentation: A Review

Multi-modal Sensor Fusion for Auto Driving Perception: A Survey