Abstract:Recently there has been significant theoretical progress on understanding the convergence and generalization of gradient-based methods on nonconvex losses with overparameterized models. Nevertheless, many aspects of optimization and generalization and in particular the critical role of small random initialization are not fully understood. In this paper, we take a step towards demystifying this role by proving that small random initialization followed by a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, also puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well. Concretely, we focus on the problem of reconstructing a low-rank matrix from a few measurements via a natural nonconvex formulation. In this setting, we show that the trajectory of the gradient descent iterations from small random initialization can be approximately decomposed into three phases: (I) a spectral or alignment phase where we show that that the iterates have an implicit spectral bias akin to spectral initialization allowing us to show that at the end of this phase the column space of the iterates and the underlying low-rank matrix are sufficiently aligned, (II) a saddle avoidance/refinement phase where we show that the trajectory of the gradient iterates moves away from certain degenerate saddle points, and (III) a local refinement phase where we show that after avoiding the saddles the iterates converge quickly to the underlying low-rank matrix. Underlying our analysis are insights for the analysis of overparameterized nonconvex optimization schemes that may have implications for computational problems beyond low-rank reconstruction.

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks

Rethinking Initialization of the Sinkhorn Algorithm

Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization

A weight initialization based on the linear product structure for neural networks

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

On weight initialization in deep neural networks

Global Convergence of Gradient Descent for Deep Linear Residual Networks.

A Sober Look at Neural Network Initializations

On the Role of Initialization on the Implicit Bias in Deep Linear Networks

The Orthogonality of Weight Vectors: the Key Characteristics of Normalization and Residual Connections

Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks

On the Crucial Role of Initialization for Matrix Factorization

Effects of Depth, Width, and Initialization: A Convergence Analysis of Layer-wise Training for Deep Linear Neural Networks

A mathematical framework for improved weight initialization of neural networks using Lagrange multipliers

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

An Improved Analysis of Training Over-parameterized Deep Neural Networks

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Deep orthogonal linear networks are shallow