Abstract:3D object detection, of which the goal is to obtain the 3D spatial structure information of the object, is a challenging topic in many visual perception systems, e.g., autonomous driving, augmented reality, and robot navigation. Most existing region proposal network (RPN) based 3D object detection methods generate anchors in the whole 3D searching space without using semantic information, which leads to the problem of inappropriate anchor size generation. To tackle the issue, we propose a 2D-guided precision anchor generation network (PAG-Net). Specifically speaking, we utilize a mature 2D detector to get 2D bounding boxes and category labels of objects as prior information. Then the 2D bounding boxes are projected into 3D frustum space for more precise and category-adaptive 3D anchors. Furthermore, current feature combination methods are early fusion, late fusion, and deep fusion, which only fuse features from high convolutional layers and ignore the data missing problem of point clouds. To obtain more efficient fusion of RGB images and point clouds features, we propose a multi-layer fusion model, which conducts nonlinear and iterative combinations of features from multiple convolutional layers and merges the global and local features effectively. We encode point cloud with the bird’s eye view (BEV) representation to solve the irregularity of point cloud. Experimental results show that our proposed approach improves the baseline by a large margin and outperforms most of the state-of-the-art methods on the KITTI object detection benchmark.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper mainly focuses on several key issues in 3D object detection and proposes a multi - modal 3D object detection method. Specifically, the paper attempts to solve the following problems: 1. **Inappropriate anchor box size**: - Existing methods based on 3D Region Proposal Network (RPN) usually do not consider semantic information when generating anchor boxes, resulting in inappropriate anchor box sizes. This is especially obvious in the detection of small and occluded objects, which seriously affects the final detection performance. 2. **Insufficient multi - modal feature fusion**: - Current multi - modal 3D object detection methods usually only fuse the features of high - level convolutional layers when fusing features, ignoring the information of low - level convolutional layers. This fusion method leads to the loss of useful information and affects the detection performance. 3. **Representation and processing of point cloud data**: - Point cloud data has sparsity and irregularity. Traditional representation methods (such as projection onto regular 3D voxel grids) not only increase the computational complexity but also lose important 3D pattern information. How to efficiently represent and process point cloud data is an open problem. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Precise Anchor Box Generation Network Guided by 2D (PAG - Net)**: - Utilize the mature 2D detector to obtain 2D bounding boxes and class labels as prior information, project the 2D bounding boxes into the 3D frustum space through the known camera projection matrix, and generate more precise and class - adapted 3D anchor boxes. This significantly improves the detection performance of small objects. 2. **Multi - layer feature fusion model**: - Propose a multi - layer feature fusion model, which combines the features of multiple convolutional layers in a non - linear and iterative way, effectively merging global and local features. This makes the fused features more robust and more discriminative. 3. **Improved 3D bounding box encoding method**: - Represent the 3D bounding box by encoding four corner points and a height, reducing regression redundancy, and using physical constraints and semantic information to improve the accuracy of the detection results. ### Experimental verification The paper conducts experimental verification on the KITTI object detection benchmark, focusing on evaluating the performance of 3D and BEV detection tasks. The experimental results show that the proposed multi - modal 3D object detection method achieves significant performance improvements in the detection tasks of cars, pedestrians, and cyclists at different difficulty levels, especially in the detection of small and occluded objects.

Multi-modal 3D object detection by 2D-guided precision anchor proposal and multi-layer fusion

A Multi-view 3D Vehicle Detection Method Based On Novel 3D Proposal Generation Method

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

PPF-Det: Point-Pixel Fusion for Multi-Modal 3D Object Detection

Multi-View Adaptive Fusion Network for 3D Object Detection

Cascaded Cross-Modality Fusion Network for 3D Object Detection

PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module

Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection

PCDR-DFF: multi-modal 3D object detection based on point cloud diversity representation and dual feature fusion

Deep multi-scale and multi-modal fusion for 3D object detection

Myocardial Infarction in a Patient With Hypertrophic Cardiomyopathy but Normal Coronary Arteries.

Three-Dimensional Point Cloud Object Detection Based on Feature Fusion and Enhancement

Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles

Multi-View 3D Object Detection Network for Autonomous Driving

EPNet++: Cascade Bi-Directional Fusion for Multi-Modal 3D Object Detection

MVX-Net: Multimodal VoxelNet for 3D Object Detection

Enhancing 3D object detection through multi-modal fusion for cooperative perception