FAERY: An FPGA-accelerated Embedding-based Retrieval System

Chaoliang Zeng,Layong Luo,Qingsong Ning,Yaodong Han,Yuhang Jiang,Ding Tang,Zilong Wang,Kai Chen,Chuanxiong Guo
2022-01-01
Abstract:Embedding-based retrieval (EBR) is widely used in recommendation systems to retrieve thousands of relevant candidates from a large corpus with millions or more items. A good EBR system needs to achieve both high throughput and low latency, as high throughput usually means cost saving and low latency improves user experience. Unfortunately, the performance of existing CPU- and GPU-based EBR are far from optimal due to their inherent architectural limitations. In this paper, we first study how an ideal yet practical EBR system works, and then design FAERY, an FPGA-accelerated EBR, which achieves the optimal performance of the practically ideal EBR system. FAERY is composed of three key components: It uses a high bandwidth HBM for memory bandwidth-intensive corpus scanning, a data parallelism approach for similarity calculation, and a pipeline-based approach for K-selection. To further reduce hardware resources, FAERY introduces a filter to early drop the non-Top-K items. Experiments show that the degraded FAERY with the same memory bandwidth of GPU still achieves 1:21 x-12:27 x lower latency and up to 4:29 x higher throughput under a latency target of 10 ms than GPU-based EBR.
What problem does this paper attempt to address?