LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Mo Sun,Zihan Yang,Changyue Liao,Yingtao Li,Fei Wu,Zeke Wang
2024-09-02
Abstract:The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper attempts to address the challenges encountered in efficiently training large-scale language models (LLMs) on distributed GPU clusters. Specifically, existing model sharding data parallel systems cannot fully utilize GPU resources in common GPU clusters, mainly due to: 1. **Interference between collective communication and GPU kernels**: Current systems rely on GPU cores to perform two costly collective operations (such as `all_gather` and `reduce_scatter`), which occupy the entire GPU resources, causing subsequent collective communication kernels to be blocked and unable to fully overlap with GPU computation kernels. 2. **High CPU optimizer overhead**: Existing systems shard optimizer states and parameters across GPUs and use the CPU for optimizer updates. This results in GPUs being idle during optimizer execution, while accessing optimizer states from SSDs also takes a significant amount of time. To alleviate these issues, the authors propose LuWu, a novel in-network optimizer, aimed at improving the training efficiency of large-scale models through the following methods: - **Offloading optimizer states and parameters from GPU worker nodes to in-network optimizer nodes**, thereby reducing the overhead on each GPU worker node. - **Offloading collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch collaborative optimization**, to eliminate interference between collective communication and GPU kernels. - **Introducing many-to-one collective primitives (such as `push` and `pull`)**, utilizing in-network aggregation techniques to minimize collective communication traffic. Experimental results show that LuWu improves performance by 3.98 times compared to state-of-the-art training systems when training a 175B scale model.