Deep Hough Voting for 3D Object Detection in Point Clouds

Charles R. Qi,Or Litany,Kaiming He,Leonidas J. Guibas
2019-08-23
Abstract:Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new 3D object detection method that works directly on point cloud data without relying on 2D images or 2D detection results. Specifically, the paper addresses the following issues: 1. **Directly processing point cloud data**: Most existing 3D object detection methods are influenced by 2D detection frameworks. They typically convert irregular point clouds into regular 3D voxel grids or bird's-eye view images, or rely on detections in 2D images to propose 3D boxes. These methods either lose the sparse information of the point cloud or are computationally expensive. 2. **Challenge of point cloud sparsity**: Since point cloud data is inherently sparse, predicting 3D bounding box parameters directly from scene points becomes difficult, especially when the 3D object center is far from any surface points, which is common in point clouds. To address the above issues, the authors propose VoteNet, an end-to-end 3D object detection network that combines deep point set networks and the Hough voting mechanism. The key innovations of VoteNet include: - **Voting mechanism**: By generating new "voting" points that point to the object center, it can effectively aggregate information from different parts of the object, even if these parts may be far apart. This allows the formation of vote clusters near the object center, from which bounding box proposals can be generated. - **End-to-end optimization**: Unlike traditional Hough voting, VoteNet is a fully differentiable architecture that can be trained end-to-end via backpropagation. - **Directly processing raw point clouds**: VoteNet operates directly on raw point cloud data, avoiding the information loss associated with converting point clouds into regular structures and leveraging the sparsity of point clouds. Experimental results show that VoteNet achieves state-of-the-art performance on two challenging 3D object detection datasets (SUN RGB-D and ScanNet), significantly outperforming previous methods that typically combine color images and geometric information, especially when using only geometric information. Additionally, the paper provides a detailed analysis of the importance of the voting mechanism in improving 3D object detection performance, particularly in cases where the object center is far from the surface (e.g., tables, bathtubs, etc.).