Gradient Descent: The Ultimate Optimizer

Kartik Chandra,Audrey Xie,Jonathan Ragan-Kelley,Erik Meijer
DOI: https://doi.org/10.48550/arXiv.1909.13371
2022-10-15
Abstract:Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time. We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this algorithm (see <a class="link-external link-http" href="http://people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer" rel="external noopener nofollow">this http URL</a>).
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically adjust the hyper - parameters (such as step size, momentum coefficient, etc.) of the optimizer during the gradient - descent optimization process, in order to reduce the workload of manual parameter tuning and improve the efficiency and stability of model training. Specifically, the paper proposes a method that automatically calculates the "hyper - gradient" through automatic differentiation techniques, so that these hyper - parameters can be updated in each iteration without manually deriving complex expressions. This method is not only applicable to simple step - size adjustment, but can also be extended to other hyper - parameters, and can be recursively applied to higher - order hyper - parameters, forming a multi - layer optimizer stack, making the entire optimization process more robust to the selection of initial hyper - parameters. ### Main contributions of the paper: 1. **Automated hyper - parameter adjustment**: Through automatic differentiation techniques, the hyper - gradient is automatically calculated, simplifying the process of hyper - parameter adjustment and avoiding the tedium and errors of manually deriving complex expressions. 2. **High extensibility**: The method is not only applicable to step size, but can also be extended to other hyper - parameters (such as momentum coefficient), and can be recursively applied to higher - order hyper - parameters. 3. **Improved robustness**: As the optimizer stack increases, the entire optimization process becomes more robust to the selection of initial hyper - parameters, reducing the dependence on manual parameter tuning. 4. **Verification in practical applications**: Experiments were carried out on multiple neural network architectures (such as MLP, CNN, RNN), verifying the effectiveness of the method. ### Specific implementation: - **Standard gradient descent**: \[ w_{i + 1}=w_i-\alpha\frac{\partial f(w_i)}{\partial w_i} \] - **Hyper - parameter update**: \[ \alpha_{i + 1}=\alpha_i-\kappa\frac{\partial f(w_i)}{\partial\alpha_i} \] \[ w_{i + 1}=w_i-\alpha_{i + 1}\frac{\partial f(w_i)}{\partial w_i} \] ### Experimental results: - **SGD on the MNIST dataset**: The hyper - optimized SGD is significantly better than the ordinary SGD, and can achieve better results even when using other optimizers (such as Adam) to adjust the step size of SGD. - **ResNet on the CIFAR - 10 dataset**: The hyper - optimizer can match or exceed the performance of the ordinary optimizer under different initial hyper - parameter settings, and can even learn a decay plan similar to the hand - designed learning rate decay strategy. - **RNN on the Tolstoy dataset**: The hyper - optimizer can accelerate convergence under different initial learning rates, and improve performance when the initial learning rate is too high or too low. ### Conclusion: The method proposed in the paper realizes the automatic adjustment of hyper - parameters through automatic differentiation techniques, improves the efficiency and robustness of model training, and reduces the dependence on manual parameter tuning. The experimental results show that this method performs well on multiple neural network architectures and has broad application prospects.