GauSPU: 3D Gaussian Splatting Processor for Real-Time SLAM Systems

Lizhou Wu,Haozhe Zhu,Siqi He,Jiapei Zheng,Chixiao Chen,Xiaoyang Zeng
DOI: https://doi.org/10.1109/micro61859.2024.00114
2024-01-01
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a promising technique in the realms of 3D vision and robotics. Its capacity for rapid rendering and high-fidelity reconstruction makes it an attractive candidate for integration into Simultaneous Localization and Mapping (SLAM) systems. However, existing 3DGS-based SLAM systems still suffer from inadequate tracking throughput due to tremendous recursion in volume rendering and irregular memory access for gradient backpropagation. To address these challenges, this paper proposes GauSPU, an algorithm-hardware co-designed accelerator for supporting real-time 3DGS-based SLAM. On the algorithm side, we present a sparse-tile-sampling (STS) method for efficient pose tracking. The STS focuses on informative image regions, discarding the rest to alleviate computational workload while maintaining accuracy. At the hardware level, we make twofold efforts. Firstly, we design a sparsity-adaptive ray recursion unit (SA-RRU) to accelerate volume rendering by leveraging irregular spatial sparsity. The SA-RRU introduces a sub-tile-wise execution pattern and a Morton-based thread allocation scheme to optimize sparsity utilization. Additionally, a sparsity-aware task dispatcher ensures efficient fine-grained task scheduling. Secondly, we propose a memory-access-relaxed backpropagation engine (MAR-BE) for efficient gradient aggregation. It comprises a gradient buffer unit (GBU) for coalescing partial gradients and a pose backward unit (PBU) for pipeline-fused backpropagation, collaboratively eliminating the costly atomic operations. Sufficient experiments demonstrate that, through the integration of GauSPU and GPU, the system achieves a throughput of 33.6 FPS for real-time pose tracking in 3DGS-SLAM, presenting a significant $63.9\times$ improvement in energy efficiency compared to the RTX3090 baseline.
What problem does this paper attempt to address?