Distributed learning with compressed gradient differences*

K. Mishchenko,E. Gorbunov,M. Takáč,P. Richtárik
DOI: https://doi.org/10.1080/10556788.2024.2358790
2024-09-29
Optimization Methods and Software
Abstract:Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression of updates were recently proposed using sparsification or quantization. However, none of the prior methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new method – DIANA – which resolves this issue via compression of gradient differences . We provide theory in the strongly convex and nonconvex settings that shows improved convergence rates, and use it to obtain the first convergence rate for the previously proposed method TernGrad . Finally, we provide theory to support non-smooth regularizers.
operations research & management science,mathematics, applied,computer science, software engineering
What problem does this paper attempt to address?