Abstract:LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

SGFNet: Segmentation Guided Fusion Network for 3D Object Detection.

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

SGMFNet: a remote sensing image object detection network based on spatial global attention and multi-scale feature fusion

AMFF-Net: An Effective 3D Object Detector Based on Attention and Multi-Scale Feature Fusion

ℱ3-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images

RGB and LiDAR Fusion-based 3D Semantic Segmentation for Autonomous Driving

FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

Multi-View Adaptive Fusion Network for 3D Object Detection

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

FGCN: Image-Fused Point Cloud Semantic Segmentation with Fusion Graph Convolutional Network

LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross- Modal Fusion

Multi-Modal 3D Object Detection by Box Matching

Closing the Calibration Gap: A Real-Time Multi-Modal Fusion Framework for 3D Semantic Segmentation

FS-Net: LiDAR-Camera Fusion With Matched Scale for 3D Object Detection in Autonomous Driving

Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles

APPFNet: Adaptive point-pixel fusion network for 3D semantic segmentation with neighbor feature aggregation

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Cascaded Cross-Modality Fusion Network for 3D Object Detection