Abstract:The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight $\boldsymbol{W}$ to $\boldsymbol{W}'$, which makes $\boldsymbol{W}'$ independent to the magnitude of $\boldsymbol{W}$. Surprisingly, $\boldsymbol{W}$ must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. In this paper, we \emph{theoretically} prove that the weight decay term $\frac{1}{2}\lambda||{\boldsymbol{W}}||^2$ merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an $\epsilon-$shifted $L_2$ regularizer, which shifts the $L_2$ objective by a positive constant $\epsilon$. Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of $\epsilon-$shifted $L_2$ regularizer is comprehensively validated on the ImageNet, CIFAR-100, and COCO datasets. Our codes and pretrained models will be released in <a class="link-external link-https" href="https://github.com/implus/PytorchInsight" rel="external noopener nofollow">this https URL</a>.

Do deep nets really need weight decay and dropout?

Why Do We Need Weight Decay in Modern Deep Learning?

Adaptive Weight Decay for Deep Neural Networks

Understanding Decoupled and Early Weight Decay

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

Late Breaking Results: Weight Decay is ALL You Need for Neural Network Sparsification.

How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Weight Decay with Tailored Adam on Scale-Invariant Weights for Better Generalization.

Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence

Can we avoid Double Descent in Deep Neural Networks?

Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer

Decoupled Weight Decay for Any $p$ Norm

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

Three Mechanisms of Weight Decay Regularization

Guided Dropout: Improving Deep Networks Without Increased Computation

An Analysis of Weight Decay As a Methodology of Reducing Three-Layer Feedforward Artificial Neural Networks for Classification Problems

Excitation Dropout: Encouraging Plasticity in Deep Neural Networks

Understanding the Disharmony between Weight Normalization Family and Weight Decay: $ε-$shifted $L_2$ Regularizer