A Point Transformer Accelerator with Fine-Grained Pipelines and Distribution-Aware Dynamic FPS

Yaoxiu Lian,Xinhao Yang,Ke Hong,Yu Wang,Guohao Dai,Ningyi Xu
DOI: https://doi.org/10.1109/iccad57390.2023.10323766
2023-01-01
Abstract:Recently, point-based point cloud neural networks have been applied to various 3D point cloud scenarios. Among them, transformer-based point cloud neural networks achieve state-of-the-art accuracy. However, there still exist three challenges that: (1) the data dependency between the transition down and feature extraction process hinders parallel execution in networks like Point Transformer; (2) farthest point sampling (FPS) operator has redundant memory access and computational overhead during the transition down process and (3) the intermediate results require repeated memory access and calculation between the FPS and kNN operators in the transition down process. As a result, typical networks like Point Transformer process on average 17.80 frames per second on NVIDIA Jetson Orin, which cannot meet the requirements of real-time perception (~30 frames per second). In this paper, we propose PTrAcc, a Point Transformer Accelerator with fine-grained pipelines and distribution-aware dynamic FPS. Computation graph level: Since we find that there is little accuracy loss with a narrowed receptive field in Point Transformer, PTrAcc removes the MaxPool and attention-kNN layers and their attached data dependencies with negligible accuracy loss to enable fine-grained pipelines. Consequently, the inference is accelerated by 1.05×. Operator level: Since the distribution of accessed points varies in different FPS iterations, PTrAcc introduces distribution-aware dynamic FPS to reduce redundant memory access and computation overhead based on the distribution. As a result, the speed of the FPS operations is increased by 1.35×. Architecture level: Since the transition down process (FPS, kNN) accounts for 71.77% of the total inference time, PTrAcc proposes a fused FPS-kNN architecture to reduce repeated memory access and distance calculation of intermediate results, and the process is accelerated by up to 2.15×. Extensive experimental results show that, PTrAcc achieves up to 1.63× and 2.38× end-to-end speedup over state-of-the-art accelerators, MARS [1] and PointAcc [2], on various point cloud neural networks, respectively.
What problem does this paper attempt to address?