DV-Det: Efficient 3D Point Cloud Object Detection with Dynamic Voxelization

Zhaoyu Su,Pin Siang Tan,Yu-Hsing Wang
DOI: https://doi.org/10.48550/arXiv.2107.12707
2021-07-27
Abstract:In this work, we propose a novel two-stage framework for the efficient 3D point cloud object detection. Instead of transforming point clouds into 2D bird eye view projections, we parse the raw point cloud data directly in the 3D space yet achieve impressive efficiency and accuracy. To achieve this goal, we propose dynamic voxelization, a method that voxellizes points at local scale on-the-fly. By doing so, we preserve the point cloud geometry with 3D voxels, and therefore waive the dependence on expensive MLPs to learn from point coordinates. On the other hand, we inherently still follow the same processing pattern as point-wise methods (e.g., PointNet) and no longer suffer from the quantization issue like conventional convolutions. For further speed optimization, we propose the grid-based downsampling and voxelization method, and provide different CUDA implementations to accommodate to the discrepant requirements during training and inference phases. We highlight our efficiency on KITTI 3D object detection dataset with 75 FPS and on Waymo Open dataset with 25 FPS inference speed with satisfactory accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problems of efficiency and accuracy in 3D point - cloud object detection. Specifically, the authors propose a new two - stage framework DV - Det, which aims to directly process the raw point - cloud data in 3D space instead of converting it into 2D bird - eye - view projections, thereby achieving efficient 3D point - cloud object detection. ### Main problems 1. **Limitations of existing methods**: - **Grid - based methods**: Although these methods can utilize 3D convolutional neural networks (CNNs) to extract features and perform bounding box prediction, converting the point cloud into a regular grid will lead to information loss, and due to the quantization problem, high - level CNN feature maps often lack precise regional feature representations. - **Point - based methods**: Such as PointNet and its variants, can directly learn point features from the original 3D LiDAR points, avoiding the quantization problem, but rely on multi - layer perceptrons (MLPs), which are computationally expensive and result in slower inference speeds. 2. **Requirement for real - time performance**: - Existing mainstream point - cloud detection methods on the KITTI dataset usually can only reach about 20 FPS (a few can reach 45 FPS), which is not enough for the LiDAR sampling rate (20 Hz) in a full 360° field - of - view (FOV) scenario, especially when the computing resources on edge devices are limited. ### Solutions proposed in the paper To overcome the above problems, the authors propose the following innovations: 1. **Dynamic Voxelization**: - Voxelize the point cloud instantaneously on a local scale, preserve the point - cloud geometric structure, avoid the need for expensive MLPs, and at the same time solve the quantization problem. 2. **Grid - based down - sampling and hierarchical point convolution**: - Propose a grid - based down - sampling method to efficiently select key points and instantaneously construct 3D convolution kernels during the forward propagation process to ensure real - time performance. 3. **Location - Aware RoI Pooling**: - A lightweight pooling method, which is 3 times faster and 4 times more memory - efficient than previous work, significantly improving the efficiency of RoI pooling. 4. **3D IoU Loss Function**: - Develop an efficient 3D IoU loss calculation algorithm, which is entirely based on the native operations of modern deep - learning frameworks (such as TensorFlow and PyTorch), without the need for manual implementation of back - propagation. Through these innovations, DV - Det achieves an inference speed of 75 FPS on the KITTI dataset and 25 FPS on the Waymo Open dataset while maintaining satisfactory accuracy. ### Summary The main contribution of this paper lies in proposing a new framework that combines the advantages of grid - based and point - based methods. Through techniques such as dynamic voxelization, efficient down - sampling, and hierarchical convolution, it significantly improves the speed and accuracy of 3D point - cloud object detection, meeting the requirements of real - time applications.