ParallelNN: A Parallel Octree-based Nearest Neighbor Search Accelerator for 3D Point Clouds
Faquan Chen,Rendong Ying,Jianwei Xue,Fei Wen,Peilin Liu
DOI: https://doi.org/10.1109/hpca56546.2023.10070940
2023-01-01
Abstract:As Light Detection And Ranging (LiDAR) increasingly becomes an essential component in robotic navigation and autonomous driving, the processing of high throughput 3D point clouds in real time is widely required. This work considers the point cloud k-Nearest Neighbor (kNN) search, which is an important 3D processing kernel. Although applying fine-grained parallelism optimization on internal processing, e.g., using multiple workers, has demonstrated high efficiency, previous accelerators with DDR external memory are fundamentally limited by the external bandwidth bottleneck. To break this bottleneck, this work proposes a highly parallel architecture, namely ParallelNN, for highly efficient kNN search processing of high throughput point clouds. First, we optimize the multichannel cache based on High Bandwidth Memory (HBM) and on-chip memory to provide large external bandwidth. Then, a novel parallel depth-first octree construction algorithm is proposed and mapped onto multiple construction branches with trace-coded construction queues, which can regularize random accesses and perform multi-branch octree construction efficiently. Furthermore, in the search stage, we present algorithm-architecture co-optimization, including parallel keyframe-based scheduling and multi-branch flexible search engines, to provide conflict-free access and maximum reuse opportunities for reference points, which achieves more than 27.0× speedup compared with baseline architectures. We prototype ParallelNN on Virtex HBM FPGA and perform extensive benchmarking on the KITTI dataset. The results demonstrate that ParallelNN achieves up to 107.7× and 12.1× speedup over CPU and GPU implementations, while being more energy efficient, e.g., outperforming CPU and GPU implementations by 73.6× and 31.1×, respectively. Besides, with the proposed algorithm-architecture co-optimization, ParallelNN achieves 11.4× speedup over state-of-the-art architecture. Moreover, ParallelNN is configurable and can be easily generalized to similar octree-based applications.