Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (<a class="link-https" data-arxiv-id="2301.02654" href="https://arxiv.org/abs/2301.02654">arXiv:2301.02654</a>), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing

On Optimizing the Communication of Model Parallelism

Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

RTP: Rethinking Tensor Parallelism with Memory Deduplication

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

Collaborative Inference for Large Models with Task Offloading and Early Exiting

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Communication Compression for Tensor Parallel LLM Inference

Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Accelerating Large Language Model Training with In-Package Optical Links for Scale-Out Systems

Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference