Alleviating All-to-All Communication for Deep Learning Recommendation Model Inference

Songjun Huang,Yihong Li,Liangkun Chen,Xiaoxi Zhang,Shuo Liu,Jingpu Duan,Wenfei Wu,Xu Chen
DOI: https://doi.org/10.1145/3677333.3678267
2024-01-01
Abstract:Massive DLRMs require large-scale multi-node systems for distributed training and inference, thus suffering from the all-to-all communication bottleneck. We propose an architecture, EmbedSwitch, that offloads the cache function of the embedding table vectors to a programmable switch, to overcome this bottleneck and provide switch-level response latency for embedding table vector requests.
What problem does this paper attempt to address?