Abstract:Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the autonomous driving scenario, the existing multi - modal 3D object detectors show insufficient robustness and generalization ability when dealing with complex and harsh environmental conditions in the real world. Although these detectors can achieve state - of - the - art performance on clean datasets, they often overlook the complexity and harsh conditions existing in the real - world environment, such as weather conditions like rain, snow, fog, and strong light. These problems lead to a significant performance degradation of existing methods when encountering these unseen out - of - distribution (OOD) noise scenarios. To address this challenge, the paper proposes the RoboFusion framework, which utilizes visual foundation models (VFMs), such as the Segment Anything Model (SAM), to improve the robustness and generalization ability of multi - modal 3D object detection in OOD noise scenarios. Specifically, RoboFusion addresses the above problems through the following aspects: 1. **Adaptive adjustment of SAM**: The paper first adaptively adjusts the original SAM to make it more suitable for the autonomous driving scenario, which is called SAM - AD. 2. **Feature up - sampling**: To align SAM or SAM - AD with multi - modal methods, the AD - FPN module is introduced for up - sampling image features. 3. **Wavelet decomposition denoising**: Wavelet decomposition technology is used to denoise depth - guided images to further reduce noise and weather interference. 4. **Self - attention mechanism**: The self - attention mechanism is adopted to adaptively re - weight the fused features, enhancing useful information while suppressing redundant noise. Through these methods, RoboFusion significantly reduces the influence of noise and improves the robustness and generalization ability of multi - modal 3D object detection in noisy environments. Experimental results show that RoboFusion achieves state - of - the - art performance in noisy scenarios in benchmark tests such as KITTI - C and nuScenes - C.

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Frustum FusionNet: Amodal 3D Object Detection with Multi-Modal Feature Fusion

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

Dense Frustum-Aware Fusion for 3D Object Detection in Perception Systems

Progressive Multi-Modal Fusion for Robust 3D Object Detection

Multi-Modal 3D Object Detection by Box Matching

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

AFTR: A Robustness Multi-Sensor Fusion Model for 3D Object Detection Based on Adaptive Fusion Transformer

mmFUSION: Multimodal Fusion for 3D Objects Detection

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

AMFF-Net: An Effective 3D Object Detector Based on Attention and Multi-Scale Feature Fusion

SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection