Abstract:3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in 3D object detection and tracking, existing methods usually rely on manually - designed proxies (such as anchors or center points), which are inefficient and highly complex when dealing with 3D data. Specifically, traditional 3D object detectors will convert sparse voxel features into dense feature maps and then process them through dense prediction heads. This not only increases the computational cost but also leads to redundant predictions, requiring post - processing steps such as non - maximum suppression (NMS) to remove duplicate detection results. To solve these problems, the paper proposes VoxelNeXt, which is a fully sparse - voxel - based 3D object detection and tracking framework. The core idea of VoxelNeXt is to directly predict 3D objects based on sparse voxel features without relying on manually - designed proxies (such as anchors or center points). This method not only simplifies the detection process and improves efficiency but also shows superior performance in multiple benchmark tests. Specific improvement points include: 1. **Fully Sparse Convolutional Network**: VoxelNeXt uses a powerful fully sparse convolutional network to directly predict 3D objects from sparse voxel features, avoiding the sparse - to - dense conversion process. 2. **Efficient Down - sampling Layers**: By adding additional down - sampling layers, the receptive field is enlarged, enabling the network to better capture the features of large objects. 3. **Sparse Max - Pooling**: During the inference process, sparse max - pooling is used to select voxels with local maximum values, avoiding the NMS post - processing step and further improving efficiency. 4. **Sparse Height Compression**: The 3D voxel features are compressed into 2D sparse feature maps, reducing the amount of computation. 5. **Spatial Voxel Pruning**: Irrelevant background voxels are gradually removed, reducing unnecessary computations. Through these improvements, VoxelNeXt has achieved leading performance in multiple benchmark tests such as nuScenes, Waymo, and Argoverse2, especially in 3D object detection and tracking tasks.

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

VS-Net: A Voxel Encoding and Sparse Convolution Embedded Network for LiDAR 3D Object Detection.

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

VoxelFSD: voxel-based fully sparse detector with sparse convolution for 3D object detection

VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

VoxelTrack: Exploring Multi-level Voxel Representation for 3D Point Cloud Object Tracking

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-Based 3-D Object Detection

SP-Net: A Sparse Convolution and Point-Encoding Enhanced Network for 3D Object Detection in LiDAR Point Clouds.

VP-Net: Voxels as Points for 3D Object Detection

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

VP-Net: Voxels As Points for 3-D Object Detection.

DVFENet: Dual-branch voxel feature extraction network for 3D object detection

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

AVFP-MVX: Multimodal VoxelNet with Attention Mechanism and Voxel Feature Pyramid

SVGA-Net: Sparse Voxel-Graph Attention Network for 3D Object Detection from Point Clouds

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Dense-JANet for Monocular 3D Object Detection

VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

VoxelNextFusion: A Simple, Unified, and Effective Voxel Fusion Framework for Multimodal 3-D Object Detection