Abstract:A Mathematical details 2 A.1 Additional notes on setup, preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A.1.1 Classical results on GD convergence, SGD convergence . . . . . . . . . . . . . . . . . . 2 A.1.2 Notations for DeltaGrad with SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A.1.3 Classical results for random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A.2 Results for deterministic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.2.1 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.2.2 Proof that Quasi-Hessians are well-conditioned . . . . . . . . . . . . . . . . . . . . . . 5 A.2.3 Proof preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 A.2.4 Main recursions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 A.2.5 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A.2.6 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2.7 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2.8 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Results for stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3.1 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3.2 Proof preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3.3 Main recursions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3.4 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.3.5 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A.3.6 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A.3.7 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Appendix for DeltaGrad: Rapid retraining of machine learning models

DeltaGrad: Rapid Retraining of Machine Learning Models

Beyond the Edge of Stability via Two-step Gradient Updates

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Incremental Gauss-Newton Descent for Machine Learning

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

A New Adaptive Gradient Method with Gradient Decomposition

Demystifying SGD with Doubly Stochastic Gradients

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions

Generalization Bounds for Gradient Methods via Discrete and Continuous Prior

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections.

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Does SGD really happen in tiny subspaces?

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

GD doesn't make the cut: Three ways that non-differentiability affects neural network training