Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Shaoqing Xu,Fang Li,Ziying Song,Jin Fang,Sifen Wang,Zhi-Xin Yang

2023-06-17

Abstract:LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in multimodal 3D object detection: 1. **Boundary Blurring Effect**: When projecting the semantic information of 2D images onto 3D point clouds, the resolution limitation of 2D feature maps often leads to boundary blurring, mainly due to incorrect semantic segmentation. 2. **Fusion of Different Modal Information**: Existing multimodal 3D object detection frameworks typically integrate the semantic knowledge of 2D images into 3D LiDAR point clouds to improve detection accuracy. However, effectively fusing the semantic information of 2D and 3D remains a challenge. 3. **Utilization of Multi-Scale Features**: To efficiently detect objects of different sizes, it is necessary to utilize multi-scale receptive fields to capture global contextual information and local spatial details. To address these issues, the paper proposes a multimodal fusion framework—**Multi-Sem Fusion (MSF)**, which includes the following components: - **2D/3D Semantic Parsing**: Generate parsing results through semantic segmentation methods of 2D images and 3D point clouds, and reproject the 2D semantic information onto the 3D point clouds. - **Adaptive Attention Fusion Module (AAF)**: Proposes an adaptive attention mechanism that learns the fusion scores of each point or voxel to fuse the semantic information of 2D and 3D, thereby addressing the inconsistency between 2D and 3D parsing results. - **Deep Feature Fusion Module (DFF)**: Proposes a deep feature fusion module that enhances the detection performance of objects of different sizes by aggregating features from different levels, especially in the utilization of multi-scale receptive fields. Through these innovations, the paper validates the effectiveness of the framework on two large-scale public 3D object detection benchmark datasets (nuScenes and KITTI) and achieves significant performance improvements, especially reaching state-of-the-art results on the nuScenes test benchmark.

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

A Generalized Multi-Modal Fusion Detection Framework

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Dense Sequential Fusion: Point Cloud Enhancement Using Foreground Mask Guidance for Multimodal 3-D Object Detection

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

EPAWFusion: multimodal fusion for 3D object detection based on enhanced points and adaptive weights

Sparse Dense Fusion for 3D Object Detection

EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation

A Multi-phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation with Weak Supervision

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

Deep multi-scale and multi-modal fusion for 3D object detection

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images