How Does Learning Rate Decay Help Modern Neural Networks?

Kaichao You,Mingsheng Long,Jianmin Wang,Michael I. Jordan
DOI: https://doi.org/10.48550/arxiv.1908.01878
2019-01-01
Abstract:Learning rate decay (lrDecay) is a de facto technique for trainingmodern neural networks. It starts with a large learning rate and then decays itmultiple times. It is empirically observed to help both optimization andgeneralization. Common beliefs in how lrDecay works come from the optimizationanalysis of (Stochastic) Gradient Descent: 1) an initially large learning rateaccelerates training or helps the network escape spurious local minima; 2)decaying the learning rate helps the network converge to a local minimum andavoid oscillation. Despite the popularity of these common beliefs, experimentssuggest that they are insufficient in explaining the general effectiveness oflrDecay in training modern neural networks that are deep, wide, and nonconvex.We provide another novel explanation: an initially large learning ratesuppresses the network from memorizing noisy data while decaying the learningrate improves the learning of complex patterns. The proposed explanation isvalidated on a carefully-constructed dataset with tractable pattern complexity.And its implication, that additional patterns learned in later stages oflrDecay are more complex and thus less transferable, is justified in real-worlddatasets. We believe that this alternative explanation will shed light into thedesign of better training strategies for modern neural networks.
What problem does this paper attempt to address?