Abstract:Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.

Analysis of natural gradient descent for multilayer neural networks

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

Bound Analysis of Natural Gradient Descent in Stochastic Optimization Setting

Learning Time-Scales in Two-Layers Neural Networks

A Comparative Analysis of Optimization and Generalization Properties of Two-Layer Neural Network and Random Feature Models under Gradient Descent Dynamics

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

Thermodynamic Natural Gradient Descent

Is All Learning (Natural) Gradient Descent?

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Achieving High Accuracy with PINNs via Energy Natural Gradients

Unnatural Algorithms in Machine Learning

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

Natural-gradient learning for spiking neurons

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

The duality structure gradient descent algorithm: analysis and applications to neural networks

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks

A Novel Structured Natural Gradient Descent for Deep Learning