ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Adel Nabli,Louis Fournier,Pierre Erbacher,Louis Serrano,Eugene Belilovsky,Edouard Oyallon

2024-06-03

Abstract:Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge, optimization algorithms reducing inter-worker communication have emerged, such as local optimization methods used in Federated Learning. While effective in minimizing communication overhead, these methods incur significant memory costs, hindering scalability: in addition to extra momentum variables, if communications are only allowed between multiple local optimization steps, then the optimizer's states cannot be sharded among workers. In response, we propose $\textbf{AC}$cumulate while $\textbf{CO}$mmunicate ($\texttt{ACCO}$), a memory-efficient optimization algorithm tailored for distributed training of LLMs. $\texttt{ACCO}$ allows to shard optimizer states across workers, overlaps gradient computations and communications to conceal communication costs, and accommodates heterogeneous hardware. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications, eliminating the need for warmup steps and aligning with the training dynamics of standard distributed optimization while converging faster in terms of wall-clock time. We demonstrate the effectiveness of $\texttt{ACCO}$ on several LLMs training and fine-tuning tasks.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the communication overhead issues encountered during distributed training of large language models (LLMs). Specifically: 1. **Communication Overhead**: In a data-parallel setup, as the number of distributed worker nodes increases, the communication overhead required for synchronizing gradients significantly increases, which affects the efficiency of parallelization. 2. **Memory Cost**: Existing methods to reduce communication overhead, such as local optimization methods, although effective, lead to significant memory overhead, hindering scalability. These methods require additional momentum variables, and the optimizer's state cannot be sharded across multiple worker nodes. To tackle these issues, the authors propose a memory-efficient optimization algorithm called ACCO (Accumulate while you Communicate). The main features of ACCO include: - **Sharded Optimizer State**: Allows the optimizer state to be sharded across worker nodes, reducing memory requirements. - **Overlap of Computation and Communication**: By overlapping gradient computation and communication, it hides communication costs and improves training efficiency. - **Heterogeneous Hardware Adaptability**: Capable of adapting to worker nodes with different performance levels, maximizing GPU utilization. Through these improvements, ACCO can effectively hide communication costs and accelerate the distributed training of large language models without increasing memory overhead.

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

DiLoCo: Distributed Low-Communication Training of Language Models

Communication Efficient Distributed Training with Distributed Lion

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning.

CELLM: An Efficient Communication in Large Language Models Training for Federated Learning

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

Efficient and Economic Large Language Model Inference with Attention Offloading

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

AB-Training: A Communication-Efficient Approach for Distributed Low-Rank Learning

Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression

Communication-and-Computation Efficient Split Federated Learning: Gradient Aggregation and Resource Management

High Performance LDA Through Collective Model Communication Optimization

Bridging the Gap Between Memory and Communication Efficiency on Distributed Deep Learning Systems.

Mini-batch Coresets for Memory-efficient Training of Large Language Models