Abstract:Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify common issues in previous gradient compression systems and evaluation methodologies. These include excessive computational overheads; incompatibility with all-reduce; and insufficient evaluation methods, such as not using an end-to-end metric or using a 32-bit baseline instead of the stronger 16-bit baseline. We revisit common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrate how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance. Our goal is to raise awareness of the need for design and evaluation standards that naturally translate to the end-to-end utility of gradient compression.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the main bottleneck of gradient aggregation in large-scale distributed machine learning training systems. Specifically, while gradient compression can reduce the amount of communication data, it often fails to accelerate training in practice while maintaining model accuracy. The paper focuses on the following two aspects: 1. **Design Issues**: - **High Computational Overhead**: Many gradient compression schemes incur high computational overhead during the process, especially when using inefficient memory access patterns on GPUs. - **Incompatibility with All-Reduce**: Many compression schemes are not compatible with All-Reduce collective communication operations, leading to low communication efficiency. 2. **Evaluation Issues**: - **Insufficient Evaluation Methods**: Existing gradient compression research often focuses only on compression ratio and throughput, neglecting the decline in model accuracy. These studies often use full precision (FP32) as a baseline, whereas half precision (FP16) is actually a stronger baseline. - **Lack of End-to-End Performance Evaluation**: Most studies do not use end-to-end performance metrics (such as Time to Accuracy, TTA) but rely on single performance metrics, which may lead to an incomplete understanding of system performance. ### Main Contributions 1. **Identifying Design Issues**: The paper identifies common design issues in gradient compression systems that limit training speed or affect model accuracy. 2. **Improving Evaluation Methods**: The paper points out the shortcomings of existing evaluation methods and proposes improved evaluation standards, particularly using TTA as the main end-to-end performance metric. 3. **Case Studies**: Through case studies, the paper demonstrates the design and evaluation issues of three main gradient compression methods (sparsification, quantization, and low-rank decomposition) and proposes optimization techniques to improve system practicality. ### Case Studies The paper conducts detailed case studies on three common gradient compression methods (sparsification, quantization, and low-rank decomposition) and proposes some optimization techniques: 1. **TopK Sparsification**: - **Background and Issues**: TopK sparsification selects the top K coordinates of the gradient for transmission but has high computational overhead and is incompatible with All-Reduce. - **Improvements**: The TopK Chunked (TopKC) method is proposed, which reduces computational overhead by dividing the gradient into small chunks and is compatible with All-Reduce. Experimental results show that TopKC outperforms traditional TopK methods in terms of throughput and accuracy. 2. **THC Quantization**: - **Background and Issues**: THC quantization improves quantization accuracy through Random Hadamard Transform (RHT), but RHT has high computational overhead and is incompatible with All-Reduce. - **Improvements**: Partial Rotation and Saturation techniques are proposed to reduce the number of RHT iterations and increase the number of communication bits to address computational overhead and overflow issues. Experimental results show that these improvements significantly enhance system performance. 3. **PowerSGD Low-Rank Decomposition**: - **Background and Issues**: PowerSGD compresses gradients through low-rank decomposition but has high computational overhead and is incompatible with All-Reduce. - **Improvements**: The paper proposes optimization techniques for PowerSGD, but specific details are not elaborated in the abstract. ### Conclusion By identifying and addressing design and evaluation issues in gradient compression systems, the paper proposes a series of optimization techniques aimed at improving the practicality and performance of gradient compression in large-scale distributed machine learning training. These improvements not only enhance system throughput but also ensure model accuracy.

Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Gradient Compression Supercharged High-Performance Data Parallel DNN Training.

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Compressed Communication for Distributed Training: Adaptive Methods and System

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

Communication Efficient SGD via Gradient Sampling with Bayes Prior

Sparse Gradient Compression For Distributed Sgd

SKCompress: Compressing Sparse and Nonuniform Gradient in Distributed Machine Learning

A Generic, High-Performance, Compression-Aware Framework for Data Parallel DNN Training

DAGC: Data-Aware Adaptive Gradient Compression.

An efficient statistical-based gradient compression technique for distributed training systems

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation

Compressing gradients by exploiting temporal correlation in momentum-SGD

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

AC-SGD: Adaptively Compressed SGD for Communication-Efficient Distributed Learning

SK-Gradient: Efficient Communication for Distributed Machine Learning with Data Sketch.

Activations and Gradients Compression for Model-Parallel Training