Adaptive Gradient Methods Can Be Provably Faster Than SGD after Finite Epochs

Xunpeng Huang,Hao Zhou,Runxin Xu,Zhe Wang,Lei Li
2020-01-01
Abstract:Adaptive gradient methods have attracted much attention of machine learningcommunities due to the high efficiency. However their acceleration effect inpractice, especially in neural network training, is hard to analyze,theoretically. The huge gap between theoretical convergence results andpractical performances prevents further understanding of existing optimizersand the development of more advanced optimization methods. In this paper, weprovide adaptive gradient methods a novel analysis with an additional mildassumption, and revise AdaGrad to for matching a better provableconvergence rate. To find an ϵ-approximate first-order stationarypoint in non-convex objectives, we prove random shuffling achieves aÕ(T^-1/2) convergence rate, which is significantly improved byfactors Õ(T^-1/4) and Õ(T^-1/6) compared with existingadaptive gradient methods and random shuffling SGD, respectively. To the bestof our knowledge, it is the first time to demonstrate that adaptive gradientmethods can deterministically be faster than SGD after finite epochs.Furthermore, we conduct comprehensive experiments to validate the additionalmild assumption and the acceleration effect benefited from second moments andrandom shuffling.
What problem does this paper attempt to address?