FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems

Rui Ma,Evangelos Georganas,Alexander Heinecke,Sergey Gribok,Andrew Boutros,Eriko Nurvitadhi
DOI: https://doi.org/10.1109/lca.2022.3189207
IF: 2.3
2022-09-04
IEEE Computer Architecture Letters
Abstract:Training state-of-the-art artificial intelligence (AI) models requires scaling to many compute nodes and relies heavily on collective communication operations, such as all-reduce, to exchange the weight gradients between nodes. The overhead of these operations can bottleneck training performance as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead. Then, we propose a new smart network interface card (NIC) for distributed AI training using field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and optimize bandwidth utilization via data compression. The AI smart NIC frees up the system's compute resources to perform the more compute-intensive tensor operations and increases the overall node-to-node communication efficiency. We build a prototype 6-node AI training system and show that our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6×, with an estimated 2.5× performance improvement at 32 nodes.
computer science, hardware & architecture
What problem does this paper attempt to address?