Applying Cyclical Learning Rate to Neural Machine Translation

Choon Meng Lee,Jianfeng Liu,Wei Peng
DOI: https://doi.org/10.48550/arXiv.2004.02401
2020-04-06
Abstract:In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy for computer vision related convolutional networks and datasets, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks. Thus with our work, we hope to raise awareness of the importance of selecting the right optimizers and the accompanying learning rate policy, at the same time, encourage further research into easy-to-use learning rate policies.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in the Neural Machine Translation (NMT) task, how to select appropriate optimizers and learning rate strategies to improve model performance. Specifically, the authors explored the effect of applying the Cyclical Learning Rate (CLR) strategy to the training of Transformer - based neural networks. ### Problem Background In the training process of deep - learning networks, optimizers and their related learning rate settings often do not receive sufficient attention or are only slightly adjusted. However, choosing an appropriate learning rate is crucial for ensuring that the loss function converges quickly to a high - quality minimum and has good generalization ability on the test data set. Although CLR has been successfully applied in the field of computer vision, its exploration in the NMT field is the first time. ### Research Motivation 1. **Importance of Optimizers and Learning Rates**: An inappropriate learning rate will lead to overly long training time (if the learning rate is too small) or divergent training (if the learning rate is too large). Therefore, finding an appropriate learning rate is crucial for improving model performance. 2. **Limitations of Existing Research**: Most research on optimizers and learning rates mainly focuses on the field of computer vision, and the NMT field lacks similar in - depth exploration. Due to the significant differences between Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Transformer - based architectures, directly applying the research results in the CV field to NMT is not necessarily effective. 3. **Raising Community Awareness**: By experimentally verifying the effectiveness of CLR in NMT, it is hoped to draw the attention of NMT researchers to the importance of choosing optimizers and learning rate strategies and encourage further research on easy - to - use adaptive learning rate strategies. ### Main Contributions - **Performance Improvement**: It has been proven that appropriate selection of optimizers and learning rate strategies can significantly improve the performance of NMT models. - **First Application of CLR**: This is the first study to apply CLR to NMT tasks. - **Providing Guidance**: Specific guidance and suggestions on how to use CLR in NMT tasks are provided. Through these efforts, the authors aim to promote more attention to the selection of optimizers and learning rate strategies in the NMT field and facilitate the development of more efficient and effective model training methods.