Abstract:Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize relatively well to unseen data. Recently, researchers explained it by investigating the implicit bias of optimization algorithms. A remarkable progress is the work (Lyu & Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except the first-order optimization algorithms like GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. Mean-while, numerous works have provided empirical evidence that adaptive methods may suffer from poor generalization performance. However, theoretical explanation for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit bias of adaptive optimization algorithms on homogeneous neural networks. In particular, we study the convergent direction of parameters when they are optimizing the logistic loss. We prove that the convergent direction of Adam and RMSProp is the same as GD, while for AdaGrad, the convergent direction depends on the adaptive conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel and nontrivial adaptive gradient flow and surrogate margin. The theoretical findings explain the superiority on generalization of exponential moving average strategy that is adopted by RMSProp and Adam. To the best of knowledge, it is the first work to study the convergent direction of adaptive optimizations on non-linear deep neural …

The Implicit Bias of AdaGrad on Separable Data

The Implicit Bias of Adam on Separable Data

The Implicit Bias of Gradient Descent on Separable Multiclass Data

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

The implicit bias for adaptive optimization algorithms on homogeneous neural networks

Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

AdaGrad under Anisotropic Smoothness

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

On the Implicit Bias of Linear Equivariant Steerable Networks

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing

On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration

The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

On the Implicit Bias of Adam

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition