Abstract:In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.

Would decentralization hurt generalization?

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm

Topology-aware Generalization of Decentralized SGD

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm

Stability-Based Generalization Analysis of the Asynchronous Decentralized SGD

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

A(DP)$^2$2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

Why (and When) does Local SGD Generalize Better than SGD?

Asynchronous Stochastic Gradient Descent over Decentralized Datasets

Generalization Error Matters in Decentralized Learning Under Byzantine Attacks

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

Distributed Gradient Descent with Many Local Steps in Overparameterized Models

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Decentralized Federated Learning with Unreliable Communications

Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Stability and Generalization for Distributed SGDA

DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm

Adjacent Leader Decentralized Stochastic Gradient Descent