Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (<a class="link-https" data-arxiv-id="2301.02654" href="https://arxiv.org/abs/2301.02654">arXiv:2301.02654</a>), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.

Communication Compression for Tensor Parallel LLM Inference

Towards Low-bit Communication for Tensor Parallel LLM Inference

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Accelerating Large Language Model Training with Hybrid GPU-based Compression

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

TP-Aware Dequantization

A Speed Odyssey for Deployable Quantization of LLMs

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Activation Sparsity Opportunities for Compressing General Large Language Models

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

On the Compressibility of Quantized Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization