Abstract:It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work.

Locally Optimal Descent for Dynamic Stepsize Scheduling

Stochastic Steepest-Descent Optimization Of Multiple-Objective Mobile Sensor Coverage

Composing Optimized Stepsize Schedules for Gradient Descent

Optimal Linear Decay Learning Rate Schedules and Further Refinements

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

Accelerated Gradient Descent by Concatenation of Stepsize Schedules

The Road Less Scheduled

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Dynamic Learning Rate Decay for Stochastic Variational Inference

Statistical Adaptive Stochastic Gradient Methods

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

Probabilistic learning rate scheduler with provable convergence

Random Function Descent

Adaptive step size rules for stochastic optimization in large-scale learning

Accelerating Proximal Gradient Descent via Silver Stepsizes

DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule

Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization

Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems

Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds