Abstract:Background: 3D object detection based on point clouds in road scenes has attracted much attention recently. The voxel-based methods voxelize the scene to regular grids, which can be processed with the advanced feature learning frameworks based on convolutional layers for semantic feature learning. The point-based methods can extract the geometric feature of the point due to the coordinate reservations. The combination of the two is effective for 3D object detection. However, the current methods use a voxel-based detection head with anchors for classification and localization. Although the preset anchors cover the entire scene, it is not suitable for detection tasks with larger scenes and multiple categories of objects, due to the limitation of the voxel size. Additionally, the misalignment between the predicted confidence and proposals in the Regions of the Interest (ROI) selection bring obstacles to 3D object detection. Methods: We investigate the combination of voxel-based methods and point-based methods for 3D object detection. Additionally, a voxel-to-point module that captures semantic and geometric features is proposed in the paper. The voxel-to-point module is conducive to the detection of small-size objects and avoids the presets of anchors in the inference stage. Moreover, a confidence adjustment module with the center-boundary-aware confidence attention is proposed to solve the misalignment between the predicted confidence and proposals in the regions of the interest selection. Results: The proposed method has achieved state-of-the-art results for 3D object detection in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) object detection dataset. Actually, as of September 19, 2021, our method ranked 1st in the 3D and Bird Eyes View (BEV) detection of cyclists tagged with difficulty level ‘easy’, and ranked 2nd in the 3D detection of cyclists tagged with ‘moderate’. Conclusions: We propose an end-to-end two-stage 3D object detector with voxel-to-point module and confidence adjustment module.

PVFE: Point-Voxel Feature Encoders for 3D Object Detection

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

P2V-RCNN: Point to Voxel Feature Learning for 3D Object Detection From Point Clouds

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

PV-RCNN++: Semantical Point-Voxel Feature Interaction for 3D Object Detection

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

F-PVNet: Frustum-Level 3-D Object Detection on Point–Voxel Feature Representation for Autonomous Driving

PVGNet: A Bottom-Up One-Stage 3D Object Detector with Integrated Multi-Level Features

DVFENet: Dual-branch voxel feature extraction network for 3D object detection

PointFPN: A Frustum-based Feature Pyramid Network for 3D Object Detection

AVFP-MVX: Multimodal VoxelNet with Attention Mechanism and Voxel Feature Pyramid

PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection

PVConvNet: Pixel-Voxel Sparse Convolution for multimodal 3D object detection

PV-EncoNet: Fast Object Detection Based on Colored Point Cloud

MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

3D Object Detection Combining Semantic and Geometric Features from Point Clouds

VP-Net: Voxels as Points for 3D Object Detection

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

PV-SSD: A Multi-Modal Point Cloud Feature Fusion Method for Projection Features and Variable Receptive Field Voxel Features