Stochastic normalized gradient descent with momentum for large-batch training
Shen-Yi Zhao,Chang-Wei Shi,Yin-Peng Xie,Wu-Jun Li
DOI: https://doi.org/10.1007/s11432-022-3892-8
2024-10-26
Science China Information Sciences
Abstract:Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units (GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum (SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD (MSGD), which is one of the most widely used variants of SGD, to converge to an ε -stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.
computer science, information systems,engineering, electrical & electronic