Cascaded Cross-Modality Fusion Network for 3D Object Detection

Zhiyu Chen,Qiong Lin,Jing Sun,Yujian Feng,Shangdong Liu,Qiang Liu,Yimu Ji,He Xu
DOI: https://doi.org/10.3390/s20247243
IF: 3.9
2020-12-17
Sensors
Abstract:We focus on exploring the LIDAR-RGB fusion-based 3D object detection in this paper. This task is still challenging in two aspects: (1) the difference of data formats and sensor positions contributes to the misalignment of reasoning between the semantic features of images and the geometric features of point clouds. (2) The optimization of traditional IoU is not equal to the regression loss of bounding boxes, resulting in biased back-propagation for non-overlapping cases. In this work, we propose a cascaded cross-modality fusion network (CCFNet), which includes a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss to resolve these two issues. Our CMF module is developed to reinforce the discriminative representation of objects by reasoning the relation of corresponding LIDAR geometric capability and RGB semantic capability of the object from two modalities. Specifically, CMF is added in a cascaded way between the RGB and LIDAR streams, which selects salient points and transmits multi-scale point cloud features to each stage of RGB streams. Moreover, our center 3D IoU loss incorporates the distance between anchor centers to avoid the oversimple optimization for non-overlapping bounding boxes. Extensive experiments on the KITTI benchmark have demonstrated that our proposed approach performs better than the compared methods.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by cross - modal fusion using LIDAR and RGB data in 3D object detection. Specifically, the author points out that there are two main difficulties in this task: 1. **Differences in data formats and sensor positions lead to misaligned inferences between the semantic features of images and the geometric features of point clouds**: Due to the different data formats of LIDAR and RGB sensors and their different installation positions on the vehicle, it is difficult to align the semantic information extracted from the image and the geometric information extracted from the point cloud. This misalignment affects the model's ability to recognize and locate target objects. 2. **Traditional IoU optimization is not equal to the bounding box regression loss, resulting in back - propagation bias in non - overlapping cases**: In 3D object detection, when the predicted bounding box does not overlap with the ground - truth bounding box, the traditional IoU loss function cannot provide effective gradient updates, resulting in a bias problem in model training. This situation is particularly serious in 3D detection because objects in 3D space are more likely to be partially or completely non - overlapping. To address these challenges, the author proposes a Cascaded Cross - Modal Fusion Network (CCFNet), which includes a Cascaded Multi - scale Fusion module (CMF) and a new Central 3D IoU loss function. Through these two key techniques, CCFNet aims to enhance the alignment ability between different modal data and improve the model's learning effect in non - overlapping cases.