High-Speed Data Communication with Advanced Networks in Large Language Model Training

Liuyao Dai,Hao Qi,Weicong Chen,Xiaoyi Lu
DOI: https://doi.org/10.1109/mm.2024.3360081
IF: 2.8212
2024-01-01
IEEE Micro
Abstract:Large Language Models (LLMs) like GPT, BERT, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This study characterizes their training performance across various interconnects and communication protocols: TCP/IP, IPoIB, and RDMA, using data and model parallelism. RDMA-100Gbps outperforms IPoIB-100Gbps and TCP/IP-10Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91% of training time, and forward receive and Back-Embedding AllReduce in model parallelism taking up to 90%. The larger-scale experiment confirms that communication predominates iterations. Our findings underscore the significance of communication in distributed LLM training and present opportunities for optimization.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?