Abstract:Local SGD, a cornerstone algorithm in federated learning, is widely used in training deep neural networks and shown to have strong empirical performance. A theoretical understanding of such performance on nonconvex loss landscapes is currently lacking. Analysis of the global convergence of SGD is challenging, as the noise depends on the model parameters. Indeed, many works narrow their focus to GD and rely on injecting noise to enable convergence to the local or global optimum. When expanding the focus to local SGD, existing analyses in the nonconvex case can only guarantee finding stationary points or assume the neural network is overparameterized so as to guarantee convergence to the global minimum through neural tangent kernel analysis. In this work, we provide the first global convergence analysis of the vanilla local SGD for two-layer neural networks \emph{without overparameterization} and \textit{without injecting noise}, when the input data is Gaussian. The main technical ingredients of our proof are \textit{a self-correction mechanism} and \textit{a new exact recursive characterization of the direction of global model parameters}. The self-correction mechanism guarantees the algorithm reaches a good region even if the initialization is in a bad region. A good (bad) region means updating the model by gradient descent will move closer to (away from) the optimal solution. The main difficulty in establishing a self-correction mechanism is to cope with the gradient dependency between two layers. To address this challenge, we divide the landscape of the objective into several regions to carefully control the interference of two layers during the correction process. As a result, we show that local SGD can correct the two layers and enter the good region in polynomial time. After that, we establish a new exact recursive characterization of the direction of global parameters, which is the key to showing convergence to the global minimum with linear speedup in the number of machines and reduced communication rounds. Experiments on synthetic data confirm theoretical results.

A global convergence theory for deep ReLU implicit networks via over-parameterization

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Convergence of Deep Neural Networks with General Activation Functions and Pooling

Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

On the Convergence of Deep Networks with Sample Quadratic Overparameterization

Convergence Analysis for Over-Parameterized Deep Learning

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Nonparametric regression using over-parameterized shallow ReLU neural networks

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A Recipe for Global Convergence Guarantee in Deep Neural Networks

Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks

Global Convergence Analysis of Local SGD for Two-layer Neural Network Without Overparameterization

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Implicit Regularization in ReLU Networks with the Square Loss

Rethinking Gauss-Newton for learning over-parameterized models