Abstract:The growth of Large Language Models (LLMs) has necessitated large-scale distributed training. Highly optimized frameworks, however, still suffer significant losses in Model FLOPS utilization (often below 50%) due to large communication volumes. Meanwhile, our comprehensive profiling shows that the computation- and communication-intensive operators overlap well. This paper introduces DHelix, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure. Central to DHelix's design is Strand Interleaving (SI), which views the continuous stream of training micro-batches through a GPU as two strands. DHelix juxtaposes the forward and backward passes of the two strands and performs a systematic optimization for an SI plan that co-schedules the operators from the opposite strands, enabled by operator-level overlap profiling results and a dynamic-programming based search algorithm. Meanwhile, DHelix enables the two strands to share model states and space for activation data, effectively accommodating two micro-batches with under 3% extra memory space. Dhelix seamlessly integrates with all forms of existing data/model parallelism, the most challenging being pipeline parallelism, thanks to its unique model folding design that results in a W-shaped pipeline. We evaluate DHelix training with the popular Llama and GPT dense models, plus the Phi Mixture of Expert (MoE) model, across 3 GPU clusters (A40, A800, and H100). Results show that it achieves 12-40% (up to 58% MFU) and 2-29% (up to 71% MFU) improvement on the 64-A40 and 64-A800 clusters, respectively, significantly outperforming state-of-the-art methods. On the H100 cluster, though the faster network reduces DHelix's profit margin, it makes cross-node tensor parallelism promising, a practice currently prohibitive due to communication costs.

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

DiLoCo: Distributed Low-Communication Training of Language Models

DiPaCo: Distributed Path Composition

LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Distributed Deep Learning in Open Collaborations

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Coded Parallelism for Distributed Deep Learning.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Communication Patterns in Distributed Deep Learning

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining

A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization

INTELLECT-1 Technical Report

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way