DF-GAS: a Distributed FPGA-as-a-Service Architecture Towards Billion-Scale Graph-based Approximate Nearest Neighbor Search.

Shulin Zeng,Zhenhua Zhu,Jun Liu,Haoyu Zhang,Guohao Dai,Zixuan Zhou,Shuangchen Li,Xuefei Ning,Yuan Xie,Huazhong Yang,Yu Wang
DOI: https://doi.org/10.1145/3613424.3614292
2023-01-01
Abstract:Embedding retrieval is a crucial task for recommendation systems. Graph-based approximate nearest neighbor search (GANNS) is the most commonly used method for retrieval, and achieves the best performance on billion-scale datasets. Unfortunately, the existing CPU- and GPU-based GANNS systems are difficult to optimize the throughput under the latency constraints on billion-scale datasets, due to the underutilized local memory bandwidth (5-45%) and the expensive remote data access overhead (∼ 85% of the total latency). In this paper, we first introduce a practically ideal GANNS architecture for billion-scale datasets, which facilitates a detailed analysis of the challenges and characteristics of distributed GANNS systems. Then, at the architecture level, we propose DF-GAS, a Distributed FPGA-as-a-Service (FPaaS) architecture for accelerating billion-scale Graph-based Approximate nearest neighbor Search. DF-GAS uses a feature-packing memory access engine and a data prefetching and delayed processing scheme to increase local memory bandwidth by 36-42% and reduce remote data access overhead by 76.2%, respectively. At the system level, we exploit the “full-graph + sub-graph” hybrid parallel search scheme on distributed FPaaS system. It achieves million-level query-per-second with sub-millisecond latency on billion-scale GANNS for the first time. Extensive evaluations on million-scale and billion-scale datasets show that DF-GAS achieves an average of 55.4 ×, 32.2 ×, 5.4 ×, and 4.4 × better latency-bounded throughput than CPUs, GPUs, and two state-of-the-art ANNS architectures, i.e., ANNA [23] and Vstore [27], respectively.
What problem does this paper attempt to address?