Abstract:Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to implement an efficient second - order optimization method in large - scale deep learning. Specifically, the authors strive to bridge the gap between theoretically second - order optimization methods (such as Newton's method) and their efficient implementation in practical applications. Traditionally, first - order gradient methods (such as Stochastic Gradient Descent, SGD) are widely used because of their lower computation, memory and communication costs, but second - order optimization methods, despite having better convergence properties, are difficult to be widely applied due to their high costs. To solve this problem, the author proposes a specific implementation of a scalable second - order pre - conditioning method, especially a variant of the full - matrix Adagrad. Through a series of crucial algorithmic and numerical improvements, this method demonstrates a significant improvement in convergence speed and actual running time on modern deep - learning models. Moreover, this method makes full use of the currently popular heterogeneous hardware architectures (such as multi - core CPUs combined with multiple accelerator units) to improve training efficiency. ### The core contributions of the paper include: 1. **Design and implementation of a pipelined version of the optimization algorithm**: making full use of the heterogeneity and computing power of the CPU - accelerator coupling architecture. 2. **Extension of the Shampoo algorithm**: making it applicable to a wider range of deep architectures, especially when dealing with very large layers (such as embedding layers). 3. **Replacement of expensive spectral decomposition**: using an efficient and numerically stable iterative method to calculate the roots of positive definite matrices, thereby reducing the computational cost. 4. **Description of challenges and limitations in practical design**: these experiences are valuable for the design of the next - generation accelerator hardware architectures. ### The experimental results show that: - In the machine translation task, the number of training steps of the Transformer model is reduced by 50%, and the overall training time is reduced by 45%. - In the language modeling task, the number of training steps of the BERT model is reduced by 16% and a higher Masked LM accuracy is achieved. - In the click - through rate prediction task, the number of training steps of the DLRM model is reduced by half, the overall training time is reduced by 37.5%, and the AUC performance is improved by 0.56%. - In the image classification task, the number of training steps of the ResNet - 50 model is reduced by 31.7%, and the overall training time is reduced by 13%. In general, by proposing a scalable second - order optimization method, this paper achieves significant performance improvements in multiple large - scale deep - learning tasks and provides valuable insights for future hardware and software design.

Scalable Second Order Optimization for Deep Learning

Parallel Stochastic Optimization Framework for Large-Scale Non-Convex Stochastic Problems

A second-order-like optimizer with adaptive gradient scaling for deep learning

A survey of deep learning optimizers -- first and second order methods

Second-order Neural Network Training Using Complex-step Directional Derivative

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Applying Second Order Optimization to Deep Transformers with Parameter-Efficient Tuning

Old Optimizer, New Norm: An Anthology

Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation

Optimization Methods in Deep Learning: A Comprehensive Overview

A Computationally Efficient Sparsified Online Newton Method

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Scalable Nested Optimization for Deep Learning

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods

OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations

Scalable Optimization in the Modular Norm

Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

An Efficient Optimization Technique for Training Deep Neural Networks

Minibatching Offers Improved Generalization Performance for Second Order Optimizers