Abstract:In this work, we propose a novel two-stage framework for the efficient 3D point cloud object detection. Instead of transforming point clouds into 2D bird eye view projections, we parse the raw point cloud data directly in the 3D space yet achieve impressive efficiency and accuracy. To achieve this goal, we propose dynamic voxelization, a method that voxellizes points at local scale on-the-fly. By doing so, we preserve the point cloud geometry with 3D voxels, and therefore waive the dependence on expensive MLPs to learn from point coordinates. On the other hand, we inherently still follow the same processing pattern as point-wise methods (e.g., PointNet) and no longer suffer from the quantization issue like conventional convolutions. For further speed optimization, we propose the grid-based downsampling and voxelization method, and provide different CUDA implementations to accommodate to the discrepant requirements during training and inference phases. We highlight our efficiency on KITTI 3D object detection dataset with 75 FPS and on Waymo Open dataset with 25 FPS inference speed with satisfactory accuracy.

What problem does this paper attempt to address?

This paper attempts to solve the problems of efficiency and accuracy in 3D point - cloud object detection. Specifically, the authors propose a new two - stage framework DV - Det, which aims to directly process the raw point - cloud data in 3D space instead of converting it into 2D bird - eye - view projections, thereby achieving efficient 3D point - cloud object detection. ### Main problems 1. **Limitations of existing methods**: - **Grid - based methods**: Although these methods can utilize 3D convolutional neural networks (CNNs) to extract features and perform bounding box prediction, converting the point cloud into a regular grid will lead to information loss, and due to the quantization problem, high - level CNN feature maps often lack precise regional feature representations. - **Point - based methods**: Such as PointNet and its variants, can directly learn point features from the original 3D LiDAR points, avoiding the quantization problem, but rely on multi - layer perceptrons (MLPs), which are computationally expensive and result in slower inference speeds. 2. **Requirement for real - time performance**: - Existing mainstream point - cloud detection methods on the KITTI dataset usually can only reach about 20 FPS (a few can reach 45 FPS), which is not enough for the LiDAR sampling rate (20 Hz) in a full 360° field - of - view (FOV) scenario, especially when the computing resources on edge devices are limited. ### Solutions proposed in the paper To overcome the above problems, the authors propose the following innovations: 1. **Dynamic Voxelization**: - Voxelize the point cloud instantaneously on a local scale, preserve the point - cloud geometric structure, avoid the need for expensive MLPs, and at the same time solve the quantization problem. 2. **Grid - based down - sampling and hierarchical point convolution**: - Propose a grid - based down - sampling method to efficiently select key points and instantaneously construct 3D convolution kernels during the forward propagation process to ensure real - time performance. 3. **Location - Aware RoI Pooling**: - A lightweight pooling method, which is 3 times faster and 4 times more memory - efficient than previous work, significantly improving the efficiency of RoI pooling. 4. **3D IoU Loss Function**: - Develop an efficient 3D IoU loss calculation algorithm, which is entirely based on the native operations of modern deep - learning frameworks (such as TensorFlow and PyTorch), without the need for manual implementation of back - propagation. Through these innovations, DV - Det achieves an inference speed of 75 FPS on the KITTI dataset and 25 FPS on the Waymo Open dataset while maintaining satisfactory accuracy. ### Summary The main contribution of this paper lies in proposing a new framework that combines the advantages of grid - based and point - based methods. Through techniques such as dynamic voxelization, efficient down - sampling, and hierarchical convolution, it significantly improves the speed and accuracy of 3D point - cloud object detection, meeting the requirements of real - time applications.

DV-Det: Efficient 3D Point Cloud Object Detection with Dynamic Voxelization

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

Real-Time Point Cloud Object Detection via Voxel-Point Geometry Abstraction

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

P2V-RCNN: Point to Voxel Feature Learning for 3D Object Detection From Point Clouds

3D Point Cloud Object Detection Method Based on Multi-Scale Dynamic Sparse Voxelization

Accelerating Point-Voxel Representation of 3-D Object Detection for Automatic Driving

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

F-PVNet: Frustum-Level 3-D Object Detection on Point–Voxel Feature Representation for Autonomous Driving

MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

3D Object Detection Combining Semantic and Geometric Features from Point Clouds

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

3DSSD: Point-based 3D Single Stage Object Detector

Improved Point-Voxel Region Convolutional Neural Network: 3D Object Detectors for Autonomous Driving

Dynamic Multitarget Detection Algorithm of Voxel Point Cloud Fusion Based on PointRCNN

VP-Net: Voxels as Points for 3D Object Detection

Efficient Point-Based Single Scale 3D Object Detection from Traffic Scenes.

From Voxel to Point: IoU-guided 3D Object Detection for Point Cloud with Voxel-to-Point Decoder

FVNet: 3D Front-View Proposal Generation for Real-Time Object Detection from Point Clouds