OctFormer: Octree-based Transformers for 3D Point Clouds

Peng-Shuai Wang
DOI: https://doi.org/10.1145/3592131
2023-05-08
Abstract:We propose octree-based transformers, named OctFormer, for 3D point cloud learning. OctFormer can not only serve as a general and effective backbone for 3D point cloud segmentation and object detection but also have linear complexity and is scalable for large-scale point clouds. The key challenge in applying transformers to point clouds is reducing the quadratic, thus overwhelming, computation complexity of attentions. To combat this issue, several works divide point clouds into non-overlapping windows and constrain attentions in each local window. However, the point number in each window varies greatly, impeding the efficient execution on GPU. Observing that attentions are robust to the shapes of local windows, we propose a novel octree attention, which leverages sorted shuffled keys of octrees to partition point clouds into local windows containing a fixed number of points while permitting shapes of windows to change freely. And we also introduce dilated octree attention to expand the receptive field further. Our octree attention can be implemented in 10 lines of code with open-sourced libraries and runs 17 times faster than other point cloud attentions when the point number exceeds 200k. Built upon the octree attention, OctFormer can be easily scaled up and achieves state-of-the-art performances on a series of 3D segmentation and detection benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud transformers in terms of both efficiency and effectiveness. Notably, on the challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs by 7.3 in mIoU. Our code and trained models are available at <a class="link-external link-https" href="https://wang-ps.github.io/octformer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The paper proposes a new method called OctFormer for handling 3D point cloud data. The current issue is that when applying attention mechanisms to point clouds, the computational complexity is quadratic, resulting in low efficiency. To address this problem, OctFormer adopts an attention mechanism based on an Octree, which divides the point cloud into local windows containing a fixed number of points while allowing the window shape to vary, thus maintaining linear complexity and improving efficiency. The paper mentions that existing methods such as window attention suffer from significant differences in the number of points in different windows, leading to decreased computational efficiency. OctFormer sorts and groups the point cloud using an Octree structure to ensure an equal number of points in each window, simplifying implementation and requiring only 10 lines of code using standard libraries. Additionally, they introduce expanded Octree attention to enlarge the receptive field. Experimental results demonstrate that OctFormer achieves the best performance in 3D segmentation and detection benchmark tests, particularly on the ScanNet200 dataset, where its mIoU surpasses CNN-based sparse voxel and point cloud transformers. OctFormer is not only an effective backbone network for point cloud learning but also easily scalable for handling large-scale point cloud data.