Scalable Second Order Optimization for Deep Learning

Rohan Anil,Vineet Gupta,Tomer Koren,Kevin Regan,Yoram Singer
DOI: https://doi.org/10.48550/arXiv.2002.09018
2021-03-05
Abstract:Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to implement an efficient second - order optimization method in large - scale deep learning. Specifically, the authors strive to bridge the gap between theoretically second - order optimization methods (such as Newton's method) and their efficient implementation in practical applications. Traditionally, first - order gradient methods (such as Stochastic Gradient Descent, SGD) are widely used because of their lower computation, memory and communication costs, but second - order optimization methods, despite having better convergence properties, are difficult to be widely applied due to their high costs. To solve this problem, the author proposes a specific implementation of a scalable second - order pre - conditioning method, especially a variant of the full - matrix Adagrad. Through a series of crucial algorithmic and numerical improvements, this method demonstrates a significant improvement in convergence speed and actual running time on modern deep - learning models. Moreover, this method makes full use of the currently popular heterogeneous hardware architectures (such as multi - core CPUs combined with multiple accelerator units) to improve training efficiency. ### The core contributions of the paper include: 1. **Design and implementation of a pipelined version of the optimization algorithm**: making full use of the heterogeneity and computing power of the CPU - accelerator coupling architecture. 2. **Extension of the Shampoo algorithm**: making it applicable to a wider range of deep architectures, especially when dealing with very large layers (such as embedding layers). 3. **Replacement of expensive spectral decomposition**: using an efficient and numerically stable iterative method to calculate the roots of positive definite matrices, thereby reducing the computational cost. 4. **Description of challenges and limitations in practical design**: these experiences are valuable for the design of the next - generation accelerator hardware architectures. ### The experimental results show that: - In the machine translation task, the number of training steps of the Transformer model is reduced by 50%, and the overall training time is reduced by 45%. - In the language modeling task, the number of training steps of the BERT model is reduced by 16% and a higher Masked LM accuracy is achieved. - In the click - through rate prediction task, the number of training steps of the DLRM model is reduced by half, the overall training time is reduced by 37.5%, and the AUC performance is improved by 0.56%. - In the image classification task, the number of training steps of the ResNet - 50 model is reduced by 31.7%, and the overall training time is reduced by 13%. In general, by proposing a scalable second - order optimization method, this paper achieves significant performance improvements in multiple large - scale deep - learning tasks and provides valuable insights for future hardware and software design.