VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Yukang Chen,Jianhui Liu,Xiangyu Zhang,Xiaojuan Qi,Jiaya Jia
2023-03-21
Abstract:3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in 3D object detection and tracking, existing methods usually rely on manually - designed proxies (such as anchors or center points), which are inefficient and highly complex when dealing with 3D data. Specifically, traditional 3D object detectors will convert sparse voxel features into dense feature maps and then process them through dense prediction heads. This not only increases the computational cost but also leads to redundant predictions, requiring post - processing steps such as non - maximum suppression (NMS) to remove duplicate detection results. To solve these problems, the paper proposes VoxelNeXt, which is a fully sparse - voxel - based 3D object detection and tracking framework. The core idea of VoxelNeXt is to directly predict 3D objects based on sparse voxel features without relying on manually - designed proxies (such as anchors or center points). This method not only simplifies the detection process and improves efficiency but also shows superior performance in multiple benchmark tests. Specific improvement points include: 1. **Fully Sparse Convolutional Network**: VoxelNeXt uses a powerful fully sparse convolutional network to directly predict 3D objects from sparse voxel features, avoiding the sparse - to - dense conversion process. 2. **Efficient Down - sampling Layers**: By adding additional down - sampling layers, the receptive field is enlarged, enabling the network to better capture the features of large objects. 3. **Sparse Max - Pooling**: During the inference process, sparse max - pooling is used to select voxels with local maximum values, avoiding the NMS post - processing step and further improving efficiency. 4. **Sparse Height Compression**: The 3D voxel features are compressed into 2D sparse feature maps, reducing the amount of computation. 5. **Spatial Voxel Pruning**: Irrelevant background voxels are gradually removed, reducing unnecessary computations. Through these improvements, VoxelNeXt has achieved leading performance in multiple benchmark tests such as nuScenes, Waymo, and Argoverse2, especially in 3D object detection and tracking tasks.