Abstract:Optimization is an integral part of modern deep learning. Recently, the concept of learned optimizers has emerged as a way to accelerate this optimization process by replacing traditional, hand-crafted algorithms with meta-learned functions. Despite the initial promising results of these methods, issues with stability and generalization still remain, limiting their practical use. Moreover, their inner workings and behavior under different conditions are not yet fully understood, making it difficult to come up with improvements. For this reason, our work examines their optimization trajectories from the perspective of network architecture symmetries and parameter update distributions. Furthermore, by contrasting the learned optimizers with their manually designed counterparts, we identify several key insights that demonstrate how each approach can benefit from the strengths of the other.

What problem does this paper attempt to address?

This paper mainly discusses the training dynamics of Learned Optimizers, which is a method of accelerating the optimization process in deep learning by using meta-learning to replace manually designed optimization algorithms. Although these methods initially show some potential, they still have issues in stability and generalization, which limits their practical applications. The focus of the paper is to analyze the optimization trajectory of the learned optimizer from the perspective of network architecture symmetry and parameter update distribution, and compare it with the traditionally manually designed optimizer to identify key insights and demonstrate how they can learn from each other's advantages. The research found that the learned optimizer breaks the gradient constraint caused by network architecture symmetry to a greater extent in the early stages of training, which is more significant compared to manually designed optimizers such as Adam or SGD. This deviation is considered a key factor in performance degradation during optimization, and regularizing it severely damages performance, indicating the importance of the freedom of parameter updates in learned optimizers. In addition, the paper also studies the noise and covariance in the parameter updates of the learned optimizer and finds that the random noise in these updates is light-tailed but varies greatly between different samples. This suggests that while the learned optimizer reduces noise, the parameter updates have greater variations between different samples. The organization of the paper includes an introduction to the basic concepts of learned optimizers, especially L2O and the recently proposed Lion optimizer, followed by theoretical analysis including symmetry, gradient geometry, stochastic gradient noise, and update covariance. This is followed by the experimental section and discussion, and finally, it is connected to previous research to show similarities and potential advantages compared to the Lion optimizer. In summary, the paper attempts to address the problem of understanding the training dynamics of learned optimizers and how to use these dynamics to improve existing optimization algorithms, especially in terms of stability and generalization capabilities.

Investigation into the Training Dynamics of Learned Optimizers

Learning to Optimize with Dynamic Mode Decomposition

A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases

Accelerated Optimization in Deep Learning with a Proportional-Integral-derivative Controller

Reverse engineering learned optimizers reveals known and novel mechanisms

A comparative study of recently deep learning optimizers

Narrowing the Focus: Learned Optimizers for Pretrained Models

Training Learned Optimizers with Randomly Initialized Learned Optimizers

Empirical Tests of Optimization Assumptions in Deep Learning

Optimization Insights into Deep Diagonal Linear Networks

Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy

Learning to optimize with convergence guarantees using nonlinear system theory

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

On Empirical Comparisons of Optimizers for Deep Learning

Practical tradeoffs between memory, compute, and performance in learned optimizers

Transformer-Based Learned Optimization

Old Optimizer, New Norm: An Anthology

Simmering: Sufficient is better than optimal for training neural networks

Learning by Turning: Neural Architecture Aware Optimisation

No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths

On Learnable Parameters of Optimal and Suboptimal Deep Learning Models