Abstract:The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

Analysis of the expected $L_2$ error of an over-parametrized deep neural network estimate learned by gradient descent without regularization

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

Harmless Overparametrization in Two-layer Neural Networks

Generalization Error Analysis of Neural networks with Gradient Based Regularization

Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks

Consistency of Neural Networks with Regularization

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Benign Overfitting in Deep Neural Networks under Lazy Training

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models

A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks

Network as Regularization for Training Deep Neural Networks: Framework, Model and Performance

Per-Example Gradient Regularization Improves Learning Signals from Noisy Data

Over-parametrized neural networks as under-determined linear systems