Abstract:Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less explored since there are fundamental differences between finite and infinite settings. In this work, we first generalize SGD and LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization. A key challenge is the inapplicability of traditional methods in the sense of expectation or optimal values for infinite-dimensional settings in analyzing FWA's convergence. Second, the cumulative gradients introduced by FWA introduce additional confusion to the generalization analysis, especially making it more difficult to discuss them under different assumptions. Extending the final iteration convergence analysis to the FWA, this paper, under a convexity assumption, establishes a convergence bound $\mathcal{O}(\log\left(\frac{T}{k}\right)/\sqrt{T})$, where $k\in[1, T/2]$ is a constant representing the last $k$ iterations. Compared to SGD with $\mathcal{O}(\log(T)/\sqrt{T})$, we prove theoretically that FWA has a faster convergence rate and explain the effect of the number of average points. In the generalization analysis, we find a recursive representation for bounding the cumulative gradient using mathematical induction. We provide bounds for constant and decay learning rates and the convex and non-convex cases to show the good generalization performance of FWA. Finally, experimental results on several benchmarks verify our theoretical results.

Better Generalization in Fast Training: Flat Trainable Weight in Subspace

Understanding the Training Dynamics in Federated Deep Learning via Aggregation Weight Optimization

Trainable Weight Averaging for Fast Convergence and Better Generalization

Trainable Weight Averaging: A General Approach for Subspace Training

Averaging Weights Leads to Wider Optima and Better Generalization

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Train Deep Neural Networks in 40-D Subspaces

Hierarchical Weight Averaging for Deep Neural Networks

A Unified Analysis for Finite Weight Averaging

Exploring Flat Minima for Domain Generalization With Large Learning Rates

A Layer-Based Sparsification Method for Distributed DNN Training.

Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes

Understanding Training and Generalization in Deep Learning by Fourier Analysis.

EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones.

Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

Performance of Training Sparse Deep Neural Networks on GPUs

Fast Sparse Deep Neural Networks: Theory and Performance Analysis

Efficient Neural Network Training Via Forward and Backward Propagation Sparsification

SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature

Make Continual Learning Stronger via C-Flat