Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (<a class="link-https" data-arxiv-id="2301.02654" href="https://arxiv.org/abs/2301.02654">arXiv:2301.02654</a>), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Data-parallel distributed training of very large models beyond GPU capacity

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Efficient Large-Scale Language Model Training on GPU Clusters

An Efficient 2D Method for Training Super-Large Deep Learning Models

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization.

Mercury: Fast and Optimal Device Placement for Large Deep Learning Models.

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

Optimum: Runtime Optimization for Multiple Mixed Model Deployment Deep Learning Inference

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Accelerating Large Language Model Training with Hybrid GPU-based Compression

AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

Efficient Large Models Fine-tuning on Commodity Servers Via Memory-balanced Pipeline Parallelism

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models