Abstract:In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy for computer vision related convolutional networks and datasets, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks. Thus with our work, we hope to raise awareness of the importance of selecting the right optimizers and the accompanying learning rate policy, at the same time, encourage further research into easy-to-use learning rate policies.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in the Neural Machine Translation (NMT) task, how to select appropriate optimizers and learning rate strategies to improve model performance. Specifically, the authors explored the effect of applying the Cyclical Learning Rate (CLR) strategy to the training of Transformer - based neural networks. ### Problem Background In the training process of deep - learning networks, optimizers and their related learning rate settings often do not receive sufficient attention or are only slightly adjusted. However, choosing an appropriate learning rate is crucial for ensuring that the loss function converges quickly to a high - quality minimum and has good generalization ability on the test data set. Although CLR has been successfully applied in the field of computer vision, its exploration in the NMT field is the first time. ### Research Motivation 1. **Importance of Optimizers and Learning Rates**: An inappropriate learning rate will lead to overly long training time (if the learning rate is too small) or divergent training (if the learning rate is too large). Therefore, finding an appropriate learning rate is crucial for improving model performance. 2. **Limitations of Existing Research**: Most research on optimizers and learning rates mainly focuses on the field of computer vision, and the NMT field lacks similar in - depth exploration. Due to the significant differences between Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Transformer - based architectures, directly applying the research results in the CV field to NMT is not necessarily effective. 3. **Raising Community Awareness**: By experimentally verifying the effectiveness of CLR in NMT, it is hoped to draw the attention of NMT researchers to the importance of choosing optimizers and learning rate strategies and encourage further research on easy - to - use adaptive learning rate strategies. ### Main Contributions - **Performance Improvement**: It has been proven that appropriate selection of optimizers and learning rate strategies can significantly improve the performance of NMT models. - **First Application of CLR**: This is the first study to apply CLR to NMT tasks. - **Providing Guidance**: Specific guidance and suggestions on how to use CLR in NMT tasks are provided. Through these efforts, the authors aim to promote more attention to the selection of optimizers and learning rate strategies in the NMT field and facilitate the development of more efficient and effective model training methods.

Applying Cyclical Learning Rate to Neural Machine Translation

An optimization Strategy for Deep Neural Networks Training

Continual Learning for Neural Machine Translation

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Cyclical Log Annealing as a Learning Rate Scheduler

Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning

Reinforced Curriculum Learning on Pre-trained Neural Machine Translation Models

Deep Learning-Based English-Chinese Translation Research

Reciprocal Supervised Learning Improves Neural Machine Translation

Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation

Reinforcement Learning based Curriculum Optimization for Neural Machine Translation

Learning to Refine Source Representations for Neural Machine Translation

Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers

Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?

General Cyclical Training of Neural Networks

Training Deeper Neural Machine Translation Models with Transparent Attention

Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Self-Guided Curriculum Learning for Neural Machine Translation

Meta-Curriculum Learning for Domain Adaptation in Neural Machine Translation

Curriculum Recommendations Using Transformer Base Model with InfoNCE Loss And Language Switching Method