Abstract:An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such a parameter but performs competitively to those that know it. This dissertation presents our work on adaptive algorithms in following scenarios: 1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it. 2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments. 3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness.

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Tensor Programs II: Neural Tangent Kernel for Any Architecture

Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics

Feature Learning in Infinite-Width Neural Networks

Old Optimizer, New Norm: An Anthology

The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Continuous-Time Analysis of Adaptive Optimization and Normalization

Adaptive Gradient Methods at the Edge of Stability

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

On the infinite width limit of neural networks with a standard parameterization

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Adaptive Strategies in Non-convex Optimization

On Exact Computation with an Infinitely Wide Neural Net

Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

Generalizing Adam to Manifolds for Efficiently Training Transformers

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Evolution of Neural Tangent Kernels under Benign and Adversarial Training

A Generalizable Approach to Learning Optimizers

Convergence rates for the Adam optimizer

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees