Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

Guowen Zhang,Lue Fan,Chenhang He,Zhen Lei,Zhaoxiang Zhang,Lei Zhang
2024-06-19
Abstract:Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that in point - cloud - based 3D object detection, existing methods inevitably sacrifice the spatial proximity of voxels when serializing 3D voxels into 1D sequences. Specifically: 1. **Limitations of Existing Methods**: - Serialization methods (such as window segmentation, Z - shaped sorting, Hilbert sorting, etc.) are effective but will destroy the spatial proximity between voxels. - Due to the quadratic complexity of Transformer, increasing the group size cannot effectively solve this problem and will instead lead to a waste of computing resources. 2. **Proposed New Method**: - The paper introduces a new method based on the state - space model (SSM), called Voxel Mamba, which adopts an ungrouped strategy to serialize the entire voxel space into a single sequence. - Through the SSM with linear complexity, Voxel Mamba can process voxels more effectively and avoid the loss of spatial proximity caused by grouping in traditional methods. 3. **Improvement Measures**: - The dual - scale SSM block (DSB) is proposed to establish a hierarchical structure, expand the effective receptive field of the sequence, and enhance the spatial proximity of the local 3D region. - The implicit window partition (IWP) is introduced to enhance the spatial proximity of voxels through position encoding without the need for explicit window partitioning. 4. **Experimental Results**: - Experiments on the Waymo Open Dataset and nuScenes dataset show that Voxel Mamba is not only superior to existing methods in terms of accuracy but also has significant advantages in computational efficiency. In summary, this paper aims to solve the problem of voxel spatial proximity loss in existing serialization - based 3D object detection methods and provides a more efficient and accurate solution by introducing Voxel Mamba.