Abstract:Weight decay is a widely used technique for training Deep Neural Networks(DNN). It greatly affects generalization performance but the underlying mechanisms are not fully understood. Recent works show that for layers followed by normalizations, weight decay mainly affects the effective learning rate. However, despite normalizations have been extensively adopted in modern DNNs, layers such as the final fully-connected layer do not satisfy this precondition. For these layers, the effects of weight decay are still unclear. In this paper, we comprehensively investigate the mechanisms of weight decay and find that except for influencing effective learning rate, weight decay has another distinct mechanism that is equally important: affecting generalization performance by controlling cross-boundary risk. These two mechanisms together give a more comprehensive explanation for the effects of weight decay. Based on this discovery, we propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms. We also propose a simple yet effective method to tune hyperparameters of FixNorm, which can find near-optimal solutions in a few trials. On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin. Surprisingly, when scaling MobileNetV2 to the same FLOPS and applying the same tricks with EfficientNet-B0, training with FixNorm achieves 77.4%, which is only 0.3% lower. A series of SOTA results show the importance of well-tuned training procedures, and further verify the effectiveness of our approach. We set up more well-tuned baselines using FixNorm, to facilitate fair comparisons in the community.

Structure injected weight normalization for training deep networks

Efficient Structure Slimming for Spiking Neural Networks

SPARSE DEEP NEURAL NETWORKS USING <i>L</i><sub>1,</sub>-WEIGHT NORMALIZATION

SUBP: Soft Uniform Block Pruning for 1 X N Sparse CNNs Multithreading Acceleration

Projection Based Weight Normalization for Deep Neural Networks.

Projection based weight normalization: Efficient method for optimization on oblique manifold in DNNs

Deep Spiking Neural Networks with Binary Weights for Object Recognition

New Interpretations of Normalization Methods in Deep Learning.

Scaling-Based Weight Normalization for Deep Neural Networks

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs

Weight Normalization based Quantization for Deep Neural Network Compression

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

Explore the Knowledge contained in Network Weights to Obtain Sparse Neural Networks

Mean Spectral Normalization of Deep Neural Networks for Embedded Automation

Weight Conditioning for Smooth Optimization of Neural Networks

Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression

Understanding the Disharmony between Weight Normalization Family and Weight Decay: shifted Regularizer

A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM