Abstract:We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast $\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right)$ local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of $\mathcal{O}\left(\tfrac{1}{k}\right)$. For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a $\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right)$ local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.

Unbiased least squares regression via averaged stochastic gradient descent

Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods

Stochastic gradient descent for linear least squares problems with partially observed data

Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

Stochastic Zeroth Order Gradient and Hessian Estimators: Variance Reduction and Refined Bias Bounds

Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients

Statistical Inference for Polyak-Ruppert Averaged Zeroth-order Stochastic Gradient Algorithm

Online estimation of the asymptotic variance for averaged stochastic gradient algorithms

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

On Biased Stochastic Gradient Estimation

Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems

Accelerated SGD for Non-Strongly-Convex Least Squares

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random Perturbations for Stochastic Optimization

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Online Covariance Matrix Estimation in Stochastic Gradient Descent

Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Convergence in quadratic mean of averaged stochastic gradient algorithms without strong convexity nor bounded gradient

Bias Reduction in Sample-Based Optimization

Accelerated stochastic approximation with state-dependent noise