Abstract:The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Stability for the training of deep neural networks and other classifiers

Stability of accuracy for the training of DNNs via the uniform doubling condition

Do stable neural networks exist for classification problems? -- A new view on stability in AI

The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning

Measuring and Mitigating Local Instability in Deep Neural Networks

Improving the Robustness of Deep Neural Networks via Stability Training

Exploring the Stability Gap in Continual Learning: The Role of the Classification Head

Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training

Stability Analysis and Generalization Bounds of Adversarial Training

Overcoming the Stability Gap in Continual Learning

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Dynamic Learning Rate for Neural Networks: A Fixed-Time Stability Approach

Stability of decision trees and logistic regression

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Noise Sensitivity and Stability of Deep Neural Networks for Binary Classification

Data-Dependent Stability Analysis of Adversarial Training

Stability and Generalization in Free Adversarial Training

Inducing Neural Collapse in Imbalanced Learning: Do We Really Need a Learnable Classifier at the End of Deep Neural Network?

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

On Loss Functions for Deep Neural Networks in Classification