Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (<a class="link-https" data-arxiv-id="2301.02654" href="https://arxiv.org/abs/2301.02654">arXiv:2301.02654</a>), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

An Efficient 2D Method for Training Super-Large Deep Learning Models

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Efficient Large-Scale Language Model Training on GPU Clusters

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

GLP4NN: A Convergence-invariant and Network-agnostic Light-weight Parallelization Framework for Deep Neural Networks on Modern GPUs

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Parallelizing Training of Deep Generative Models on Massive Scientific Datasets

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Integrated Model, Batch and Domain Parallelism in Training Neural Networks

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training

A Memory-efficient Hybrid Parallel Framework for Deep Neural Network Training

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes