Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (<a class="link-https" data-arxiv-id="2301.02654" href="https://arxiv.org/abs/2301.02654">arXiv:2301.02654</a>), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.

Accelerating Large Language Model Training with In-Package Optical Links for Scale-Out Systems

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Efficient Large-Scale Language Model Training on GPU Clusters

High-Speed Data Communication with Advanced Networks in Large Language Model Training

Optical training of large-scale Transformers and deep neural networks with direct feedback alignment

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Data-parallel distributed training of very large models beyond GPU capacity

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Accelerating Neural Networks for Large Language Models and Graph Processing with Silicon Photonics

Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment

Fast and scalable all-optical network architecture for distributed deep learning

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Optimizing Distributed Training on Frontier for Large Language Models