LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Mo Sun,Zihan Yang,Changyue Liao,Yingtao Li,Fei Wu,Zeke Wang

2024-09-02

Abstract:The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The paper attempts to address the challenges encountered in efficiently training large-scale language models (LLMs) on distributed GPU clusters. Specifically, existing model sharding data parallel systems cannot fully utilize GPU resources in common GPU clusters, mainly due to: 1. **Interference between collective communication and GPU kernels**: Current systems rely on GPU cores to perform two costly collective operations (such as `all_gather` and `reduce_scatter`), which occupy the entire GPU resources, causing subsequent collective communication kernels to be blocked and unable to fully overlap with GPU computation kernels. 2. **High CPU optimizer overhead**: Existing systems shard optimizer states and parameters across GPUs and use the CPU for optimizer updates. This results in GPUs being idle during optimizer execution, while accessing optimizer states from SSDs also takes a significant amount of time. To alleviate these issues, the authors propose LuWu, a novel in-network optimizer, aimed at improving the training efficiency of large-scale models through the following methods: - **Offloading optimizer states and parameters from GPU worker nodes to in-network optimizer nodes**, thereby reducing the overhead on each GPU worker node. - **Offloading collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch collaborative optimization**, to eliminate interference between collective communication and GPU kernels. - **Introducing many-to-one collective primitives (such as `push` and `pull`)**, utilizing in-network aggregation techniques to minimize collective communication traffic. Experimental results show that LuWu improves performance by 3.98 times compared to state-of-the-art training systems when training a 175B scale model.

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

An Efficient 2D Method for Training Super-Large Deep Learning Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

On Optimizing the Communication of Model Parallelism

Accelerating Large Language Model Training with In-Package Optical Links for Scale-Out Systems

Accelerating Large Language Model Training with Hybrid GPU-based Compression

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Distributed Training Optimization for DCU

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Improving training time and GPU utilization in geo-distributed language model training

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training