Abstract:The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer

Understanding the Disharmony between Weight Normalization Family and Weight Decay: $ε-$shifted $L_2$ Regularizer

Robust Implicit Regularization via Weight Normalization

Three Mechanisms of Weight Decay Regularization

Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks

The Efficacy of Regularization in Two Layer Neural Networks

Optimization and Generalization Guarantees for Weight Normalization

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

Adaptive Weight Decay for Deep Neural Networks

Scaling-Based Weight Normalization for Deep Neural Networks

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree Spectral Bias of Neural Networks

Volumization as a Natural Generalization of Weight Decay

Weight Compander: A Simple Weight Reparameterization for Regularization

Improving Generalization of Deep Neural Networks by Optimum Shifting

Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks

Towards Better Robust Generalization with Shift Consistency Regularization

Weight Decay with Tailored Adam on Scale-Invariant Weights for Better Generalization.