Multi-modal 3D object detection by 2D-guided precision anchor proposal and multi-layer fusion

Yi Wu,Xiaoyan Jiang,Zhijun Fang,Yongbin Gao,Hamido Fujita
DOI: https://doi.org/10.1016/j.asoc.2021.107405
IF: 8.7
2021-09-01
Applied Soft Computing
Abstract:3D object detection, of which the goal is to obtain the 3D spatial structure information of the object, is a challenging topic in many visual perception systems, e.g., autonomous driving, augmented reality, and robot navigation. Most existing region proposal network (RPN) based 3D object detection methods generate anchors in the whole 3D searching space without using semantic information, which leads to the problem of inappropriate anchor size generation. To tackle the issue, we propose a 2D-guided precision anchor generation network (PAG-Net). Specifically speaking, we utilize a mature 2D detector to get 2D bounding boxes and category labels of objects as prior information. Then the 2D bounding boxes are projected into 3D frustum space for more precise and category-adaptive 3D anchors. Furthermore, current feature combination methods are early fusion, late fusion, and deep fusion, which only fuse features from high convolutional layers and ignore the data missing problem of point clouds. To obtain more efficient fusion of RGB images and point clouds features, we propose a multi-layer fusion model, which conducts nonlinear and iterative combinations of features from multiple convolutional layers and merges the global and local features effectively. We encode point cloud with the bird’s eye view (BEV) representation to solve the irregularity of point cloud. Experimental results show that our proposed approach improves the baseline by a large margin and outperforms most of the state-of-the-art methods on the KITTI object detection benchmark.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper mainly focuses on several key issues in 3D object detection and proposes a multi - modal 3D object detection method. Specifically, the paper attempts to solve the following problems: 1. **Inappropriate anchor box size**: - Existing methods based on 3D Region Proposal Network (RPN) usually do not consider semantic information when generating anchor boxes, resulting in inappropriate anchor box sizes. This is especially obvious in the detection of small and occluded objects, which seriously affects the final detection performance. 2. **Insufficient multi - modal feature fusion**: - Current multi - modal 3D object detection methods usually only fuse the features of high - level convolutional layers when fusing features, ignoring the information of low - level convolutional layers. This fusion method leads to the loss of useful information and affects the detection performance. 3. **Representation and processing of point cloud data**: - Point cloud data has sparsity and irregularity. Traditional representation methods (such as projection onto regular 3D voxel grids) not only increase the computational complexity but also lose important 3D pattern information. How to efficiently represent and process point cloud data is an open problem. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Precise Anchor Box Generation Network Guided by 2D (PAG - Net)**: - Utilize the mature 2D detector to obtain 2D bounding boxes and class labels as prior information, project the 2D bounding boxes into the 3D frustum space through the known camera projection matrix, and generate more precise and class - adapted 3D anchor boxes. This significantly improves the detection performance of small objects. 2. **Multi - layer feature fusion model**: - Propose a multi - layer feature fusion model, which combines the features of multiple convolutional layers in a non - linear and iterative way, effectively merging global and local features. This makes the fused features more robust and more discriminative. 3. **Improved 3D bounding box encoding method**: - Represent the 3D bounding box by encoding four corner points and a height, reducing regression redundancy, and using physical constraints and semantic information to improve the accuracy of the detection results. ### Experimental verification The paper conducts experimental verification on the KITTI object detection benchmark, focusing on evaluating the performance of 3D and BEV detection tasks. The experimental results show that the proposed multi - modal 3D object detection method achieves significant performance improvements in the detection tasks of cars, pedestrians, and cyclists at different difficulty levels, especially in the detection of small and occluded objects.