Abstract:Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new 3D object detection method that works directly on point cloud data without relying on 2D images or 2D detection results. Specifically, the paper addresses the following issues: 1. **Directly processing point cloud data**: Most existing 3D object detection methods are influenced by 2D detection frameworks. They typically convert irregular point clouds into regular 3D voxel grids or bird's-eye view images, or rely on detections in 2D images to propose 3D boxes. These methods either lose the sparse information of the point cloud or are computationally expensive. 2. **Challenge of point cloud sparsity**: Since point cloud data is inherently sparse, predicting 3D bounding box parameters directly from scene points becomes difficult, especially when the 3D object center is far from any surface points, which is common in point clouds. To address the above issues, the authors propose VoteNet, an end-to-end 3D object detection network that combines deep point set networks and the Hough voting mechanism. The key innovations of VoteNet include: - **Voting mechanism**: By generating new "voting" points that point to the object center, it can effectively aggregate information from different parts of the object, even if these parts may be far apart. This allows the formation of vote clusters near the object center, from which bounding box proposals can be generated. - **End-to-end optimization**: Unlike traditional Hough voting, VoteNet is a fully differentiable architecture that can be trained end-to-end via backpropagation. - **Directly processing raw point clouds**: VoteNet operates directly on raw point cloud data, avoiding the information loss associated with converting point clouds into regular structures and leveraging the sparsity of point clouds. Experimental results show that VoteNet achieves state-of-the-art performance on two challenging 3D object detection datasets (SUN RGB-D and ScanNet), significantly outperforming previous methods that typically combine color images and geometric information, especially when using only geometric information. Additionally, the paper provides a detailed analysis of the importance of the voting mechanism in improving 3D object detection performance, particularly in cases where the object center is far from the surface (e.g., tables, bathtubs, etc.).

Deep Hough Voting for 3D Object Detection in Point Clouds

S-VoteNet: Deep Hough Voting with Spherical Proposal for 3D Object Detection.

Enhanced Vote Network for 3D Object Detection in Point Clouds.

3D Object Detection from Point Cloud via Voting Step Diffusion

Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

A Multi-Level Semantic Fusion VoteNet for 3D Object Detection on Point Clouds

An End-to-End Deep Learning Network for 3D Object Detection From RGB-D Data Based on Hough Voting

Time-Sensitive 3D Single Target Tracking Based on Deep Hough Optimized Voting

VENet: Voting Enhancement Network for 3D Object Detection

Efficient Indoor 3D Object Detection in Point Clouds Using the Kinect Sensor

3DPVNet: Patch-level 3D Hough Voting Network for 6D Pose Estimation

Optimized CNNs for Rapid 3D Point Cloud Object Recognition

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

P2V-RCNN: Point to Voxel Feature Learning for 3D Object Detection From Point Clouds

Refined Voting and Scene Feature Fusion for 3D Object Detection in Point Clouds

A Hierarchical Graph Network for 3D Object Detection on Point Clouds

HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Multi-feature Fusion VoteNet for 3D Object Detection

Vote-Based 3D Object Detection with Context Modeling and SOB-3DNMS

REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud

CP-VoteNet: Contrastive Prototypical VoteNet for Few-Shot Point Cloud Object Detection