Abstract:The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight $\boldsymbol{W}$ to $\boldsymbol{W}'$, which makes $\boldsymbol{W}'$ independent to the magnitude of $\boldsymbol{W}$. Surprisingly, $\boldsymbol{W}$ must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. In this paper, we \emph{theoretically} prove that the weight decay term $\frac{1}{2}\lambda||{\boldsymbol{W}}||^2$ merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an $\epsilon-$shifted $L_2$ regularizer, which shifts the $L_2$ objective by a positive constant $\epsilon$. Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of $\epsilon-$shifted $L_2$ regularizer is comprehensively validated on the ImageNet, CIFAR-100, and COCO datasets. Our codes and pretrained models will be released in <a class="link-external link-https" href="https://github.com/implus/PytorchInsight" rel="external noopener nofollow">this https URL</a>.

Weight-Sharing Regularization

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantization

Robust Implicit Regularization via Weight Normalization

Explicit Regularization via Regularizer Mirror Descent

Understanding the Disharmony between Weight Normalization Family and Weight Decay: $ε-$shifted $L_2$ Regularizer

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Implicit Regularization Paths of Weighted Neural Representations

Volumization as a Natural Generalization of Weight Decay

Weight Compander: A Simple Weight Reparameterization for Regularization

Learning Symmetries via Weight-Sharing with Doubly Stochastic Tensors

Deeper Insights into Weight Sharing in Neural Architecture Search

MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles

Drastically Reducing the Number of Trainable Parameters in Deep CNNs by Inter-layer Kernel-sharing

Improved Generalization of Weight Space Networks via Augmentations

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks

Shake-Shake regularization

Decoupled Weight Decay for Any $p$ Norm

The Role of Regularization in Shaping Weight and Node Pruning Dependency and Dynamics

Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs