Abstract:The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight $\boldsymbol{W}$ to $\boldsymbol{W}'$, which makes $\boldsymbol{W}'$ independent to the magnitude of $\boldsymbol{W}$. Surprisingly, $\boldsymbol{W}$ must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. In this paper, we \emph{theoretically} prove that the weight decay term $\frac{1}{2}\lambda||{\boldsymbol{W}}||^2$ merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an $\epsilon-$shifted $L_2$ regularizer, which shifts the $L_2$ objective by a positive constant $\epsilon$. Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of $\epsilon-$shifted $L_2$ regularizer is comprehensively validated on the ImageNet, CIFAR-100, and COCO datasets. Our codes and pretrained models will be released in <a class="link-external link-https" href="https://github.com/implus/PytorchInsight" rel="external noopener nofollow">this https URL</a>.

L2 Regularization versus Batch and Weight Normalization

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

Three Mechanisms of Weight Decay Regularization

Understanding the Disharmony between Weight Normalization Family and Weight Decay: $ε-$shifted $L_2$ Regularizer

Towards Understanding Regularization in Batch Normalization

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Mean-field Analysis of Batch Normalization

Adaptive Weight Decay for Deep Neural Networks

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer

SPARSE DEEP NEURAL NETWORKS USING <i>L</i><sub>1,</sub>-WEIGHT NORMALIZATION

L1 -Norm Batch Normalization for Efficient Training of Deep Neural Networks

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

An Empirical Analysis of the Shift and Scale Parameters in BatchNorm

New Interpretations of Normalization Methods in Deep Learning.

Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements