Abstract:Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [9] and [17].

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently approximate local curvature information (i.e., second - derivative information of the loss function) in deep neural networks, especially the inverse Hessian - vector products (IHVPs). Traditional methods are relatively high in computational or storage costs, which limits their practical applications. For this reason, the author proposes two new algorithms, respectively for network compression and optimization settings, which can calculate IHVPs with linear - time complexity without using block - wise approximation, thereby reducing computational overhead and improving the effectiveness of network pruning and optimization. Specifically, the two main contributions in the paper are as follows: 1. **Static algorithm**: This algorithm is applicable to network compression scenarios. Given a fully - trained model, it is necessary to estimate IHVPs and diagonal elements of the inverse Hessian matrix in order to determine the "optimal" pruning update. Through pre - computation and recursive computation, the algorithm can accurately calculate IHVPs in O(dm) time and can query any single element of the inverse Hessian matrix in O(m) time. 2. **Dynamic algorithm**: This algorithm is extended to pre - conditioned SGD optimization, that is, pre - processing the stochastic gradient through the estimated inverse Hessian matrix. The algorithm allows adding or removing gradients within a sliding window without re - calculating second - order statistical information. Through recursive computation and dynamic maintenance of intermediate information, the algorithm can replace any gradient in the sliding window in O(dm + m ^ 3) time and calculate IHVPs in O(dm + m ^ 2) time. Both of these algorithms are significantly superior to existing second - order methods in terms of computational and storage costs, and can complete the calculation of IHVPs in linear time, thus having higher efficiency and feasibility in practical applications. Experimental results show that these algorithms have achieved state - of - the - art performance in neural network pruning and optimization.

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

Error Feedback Can Accurately Compress Preconditioners

A Trace-restricted Kronecker-Factored Approximation to Natural Gradient

SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature

An Efficient Fisher Matrix Approximation Method for Large-Scale Neural Network Optimization

Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning

Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks

Convolutional Neural Network Training with Distributed K-FAC

Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

Optimization with Access to Auxiliary Information

Block Mean Approximation for Efficient Second Order Optimization

Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks

Eva: A General Vectorized Approximation Framework for Second-order Optimization

FAGH: Accelerating Federated Learning with Approximated Global Hessian

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians

Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC