Abstract:Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense, and numerically unstable in low precision as they require matrix inversion or decomposition. These limitations render such methods unpopular for modern mixed-precision training. We address them by (i) formulating an inverse-free KFAC update and (ii) imposing structures in the Kronecker factors, resulting in structured inverse-free natural gradient descent (SINGD). On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms AdamW even in half precision. Our work closes a gap between first- and second-order methods in modern low-precision training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies in memory efficiency and numerical stability of high - order optimization methods (such as KFAC) in modern neural network training. Specifically: 1. **Memory efficiency**: Traditional high - order optimization methods (such as KFAC) have extremely high memory consumption because their pre - conditioned Kronecker factors are dense matrices. This makes these methods impractical when dealing with large - scale models. 2. **Numerical stability**: In low - precision floating - point number (such as half - precision, BFP - 16) training, high - order optimization methods need to perform matrix inversion or decomposition, which is numerically unstable. This instability restricts the application of these methods in modern mixed - precision training. To solve these problems, the paper proposes the Structured Inverse - Free Natural Gradient Descent (SINGD) method through the following two main improvements: 1. **Inverse - Free KFAC update**: By using matrix subtraction in the matrix logarithm space instead of matrix inversion, an Inverse - Free KFAC (IKFAC) update method is proposed. This method is more numerically stable, especially in low - precision training. 2. **Structured Kronecker factors**: Structures (such as diagonal, low - rank, Toeplitz, hierarchical structures) are introduced into the Kronecker factors, thereby significantly reducing memory consumption and computational cost. These structured Kronecker factors make SINGD superior to traditional high - order optimization methods in terms of memory and computational efficiency while maintaining good performance. Through these improvements, SINGD not only performs excellently in terms of memory efficiency and numerical stability but also performs well in a variety of modern architectures (such as convolutional neural networks and Transformer models), and can even run stably in half - precision training. This builds a bridge between first - order and second - order optimization methods in modern low - precision neural network training.

Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC

KAISA: an adaptive second-order optimizer framework for deep neural networks

Deep Neural Network Training with Distributed K-FAC

SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature

Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Convolutional Neural Network Training with Distributed K-FAC

A Trace-restricted Kronecker-Factored Approximation to Natural Gradient

Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Analysis and comparison of two-level KFAC methods for training deep neural networks

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Accelerating Distributed K-FAC with Efficient Collective Communication and Scheduling.

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

Structured second-order methods via natural gradient descent

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization