Abstract:The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Dropout in Training Neural Networks: Flatness of Solution and Noise Structure

A variance principle explains why dropout finds flatter minima

Implicit Regularization of Dropout

Stochastic Modified Equations and Dynamics of Dropout Algorithm

Probing the Structure and Functional Properties of the Dropout-Induced Correlated Variability in Convolutional Neural Networks

Dropout Reduces Underfitting

Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Dropout Rademacher Complexity of Deep Neural Networks.

Information Plane Analysis for Dropout Neural Networks

Dropout Drops Double Descent

Dropout Training, Data-dependent Regularization, and Generalization Bounds.

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

The alignment property of SGD noise and how it helps select flat minima: A stability analysis

Randomness Regularization with Simple Consistency Training for Neural Networks

Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.

Effective and Efficient Dropout for Deep Convolutional Neural Networks

Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model

Dropout, a basic and effective regularization method for a deep learning model: a case study