M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

Elias Frantar,Eldar Kurtic,Dan Alistarh
DOI: https://doi.org/10.48550/arXiv.2107.03356
2021-11-18
Abstract:Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [9] and [17].
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently approximate local curvature information (i.e., second - derivative information of the loss function) in deep neural networks, especially the inverse Hessian - vector products (IHVPs). Traditional methods are relatively high in computational or storage costs, which limits their practical applications. For this reason, the author proposes two new algorithms, respectively for network compression and optimization settings, which can calculate IHVPs with linear - time complexity without using block - wise approximation, thereby reducing computational overhead and improving the effectiveness of network pruning and optimization. Specifically, the two main contributions in the paper are as follows: 1. **Static algorithm**: This algorithm is applicable to network compression scenarios. Given a fully - trained model, it is necessary to estimate IHVPs and diagonal elements of the inverse Hessian matrix in order to determine the "optimal" pruning update. Through pre - computation and recursive computation, the algorithm can accurately calculate IHVPs in O(dm) time and can query any single element of the inverse Hessian matrix in O(m) time. 2. **Dynamic algorithm**: This algorithm is extended to pre - conditioned SGD optimization, that is, pre - processing the stochastic gradient through the estimated inverse Hessian matrix. The algorithm allows adding or removing gradients within a sliding window without re - calculating second - order statistical information. Through recursive computation and dynamic maintenance of intermediate information, the algorithm can replace any gradient in the sliding window in O(dm + m ^ 3) time and calculate IHVPs in O(dm + m ^ 2) time. Both of these algorithms are significantly superior to existing second - order methods in terms of computational and storage costs, and can complete the calculation of IHVPs in linear time, thus having higher efficiency and feasibility in practical applications. Experimental results show that these algorithms have achieved state - of - the - art performance in neural network pruning and optimization.