Cascaded Cross-Modality Fusion Network for 3D Object Detection

Zhiyu Chen,Qiong Lin,Jing Sun,Yujian Feng,Shangdong Liu,Qiang Liu,Yimu Ji,He Xu

DOI: https://doi.org/10.3390/s20247243

IF: 3.9

2020-12-17

Sensors

Abstract:We focus on exploring the LIDAR-RGB fusion-based 3D object detection in this paper. This task is still challenging in two aspects: (1) the difference of data formats and sensor positions contributes to the misalignment of reasoning between the semantic features of images and the geometric features of point clouds. (2) The optimization of traditional IoU is not equal to the regression loss of bounding boxes, resulting in biased back-propagation for non-overlapping cases. In this work, we propose a cascaded cross-modality fusion network (CCFNet), which includes a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss to resolve these two issues. Our CMF module is developed to reinforce the discriminative representation of objects by reasoning the relation of corresponding LIDAR geometric capability and RGB semantic capability of the object from two modalities. Specifically, CMF is added in a cascaded way between the RGB and LIDAR streams, which selects salient points and transmits multi-scale point cloud features to each stage of RGB streams. Moreover, our center 3D IoU loss incorporates the distance between anchor centers to avoid the oversimple optimization for non-overlapping bounding boxes. Extensive experiments on the KITTI benchmark have demonstrated that our proposed approach performs better than the compared methods.

engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by cross - modal fusion using LIDAR and RGB data in 3D object detection. Specifically, the author points out that there are two main difficulties in this task: 1. **Differences in data formats and sensor positions lead to misaligned inferences between the semantic features of images and the geometric features of point clouds**: Due to the different data formats of LIDAR and RGB sensors and their different installation positions on the vehicle, it is difficult to align the semantic information extracted from the image and the geometric information extracted from the point cloud. This misalignment affects the model's ability to recognize and locate target objects. 2. **Traditional IoU optimization is not equal to the bounding box regression loss, resulting in back - propagation bias in non - overlapping cases**: In 3D object detection, when the predicted bounding box does not overlap with the ground - truth bounding box, the traditional IoU loss function cannot provide effective gradient updates, resulting in a bias problem in model training. This situation is particularly serious in 3D detection because objects in 3D space are more likely to be partially or completely non - overlapping. To address these challenges, the author proposes a Cascaded Cross - Modal Fusion Network (CCFNet), which includes a Cascaded Multi - scale Fusion module (CMF) and a new Central 3D IoU loss function. Through these two key techniques, CCFNet aims to enhance the alignment ability between different modal data and improve the model's learning effect in non - overlapping cases.

Cascaded Cross-Modality Fusion Network for 3D Object Detection

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

EPNet++: Cascade Bi-Directional Fusion for Multi-Modal 3D Object Detection

CrossFusion: Interleaving Cross-modal Complementation for Noise-resistant 3D Object Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

BAFusion: Bidirectional Attention Fusion for 3D Object Detection Based on LiDAR and Camera

Cascaded Multi-3D-view Fusion for 3D-Oriented Object Detection

ImLiDAR: Cross-Sensor Dynamic Message Propagation Network for 3-D Object Detection

Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

CAF-RCNN: multimodal 3D object detection with cross-attention

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

Multi-Modality Task Cascade for 3D Object Detection

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

Multi-View Adaptive Fusion Network for 3D Object Detection

PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection