PVT: Point-Voxel Transformer for 3D Deep Learning

Cheng Zhang,Haocheng Wan,Shengqiang Liu,Xinyi Shen,Zizhao Wu
2021-01-01
Abstract:In this paper, we present an efficient and high-performance neural architecture, termed Point-Voxel Transformer (PVT) for 3D deep learning, which deeply integrates both 3D voxelbased and point-based self-attention computation to learn more discriminative features from 3D data. Specifically, we conduct multi-head self-attention (MSA) computation in voxels to obtain efficient learning pattern and the coarse-grained local features while performing self-attention in points to provide finer-grained information about the global context. In addition, to reduce the cost of MSA computation but achieve high efficiency, we design a cyclic shifted boxing scheme by limiting the MSA computation to non-overlapping local box and also preserving cross-box connection. Evaluated on classification benchmark, our PVT not only achieves state-of-theart accuracy of 94.0% (no voting) but outperforms previous Transformer-based models with 7× measured speedup on average. On part and semantic segmentation, our model also obtains strong performance (86.5% and 68.2% mIoU, respectively). For 3D object detection task, we replace the primitives in Frustrum PointNet with PVT layer and achieve an improvement of 8.6% AP.
What problem does this paper attempt to address?