Abstract:The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining 7.8 and 12.9 BLEU points improvements in single-step translation accuracy on distilled and raw versions of WMT'14 De-En.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of the decline in computational efficiency brought by the increase in decoding steps of Iterative Non - Autoregressive Transformers (iNATs) in machine translation. Specifically, although iNATs can improve translation quality through multi - step decoding, each additional decoding step will significantly increase the computational cost, thus weakening its computational advantage over autoregressive models. To meet this challenge, the author introduced a new distillation technique - **Distill Multiple Steps (DiMS)**. The goal of DiMS is to maintain the computational efficiency of iNATs by reducing the number of decoding steps required to achieve a specific translation quality while retaining the performance improvement brought by multi - step decoding. Specifically, DiMS uses two models: the student model and the teacher model. The student model is optimized to predict the output of the teacher model after multiple decoding steps, while the teacher model is updated by exponential moving average (EMA) to keep its knowledge up - to - date and improve the quality of labels. ### Main contributions 1. **Reducing decoding steps**: DiMS can significantly reduce the number of decoding steps required to achieve a specific translation quality, thereby improving computational efficiency. 2. **Maintaining translation quality**: Despite the reduction in decoding steps, DiMS can still maintain or even improve translation quality. 3. **Wide applicability**: DiMS can be applied to a variety of iterative models, including alignment - based models and non - alignment - based models. 4. **Experimental verification**: The author conducted experiments on multiple public datasets to verify the effectiveness of DiMS, especially the significant improvement in single - step translation. ### Experimental results - **Improvement in single - step translation performance**: On the WMT’14 and WMT’16 datasets, DiMS significantly increased the BLEU score of single - step translation, with a maximum increase of 7.8 BLEU points. - **Comparison with existing methods**: DiMS outperforms many leading NAT models specifically designed for single - step translation. - **Unsupervised distillation**: DiMS can also be applied to unsupervised settings, using the teacher model to generate synthetic target sentences for distillation, further improving performance. ### Conclusion DiMS is an effective distillation algorithm that can improve the single - step translation quality of pre - trained iterative models while maintaining computational efficiency. By setting the teacher model as the moving average of the student model, a similar performance improvement can be obtained without significantly increasing the training time. The experimental results show that DiMS is effective and versatile in both supervised and unsupervised settings. ### Future directions 1. **Automatic speech recognition**: The same family of iterative models has been applied to automatic speech recognition, so DiMS can also play a role in this field. 2. **Combining multiple techniques**: Multiple techniques introduced for iNATs can be combined to build a powerful iterative model, and its computational efficiency can be improved by DiMS. 3. **Large - scale monolingual datasets**: U - DiMS distillation can be carried out using large - scale monolingual datasets to further improve model performance. ### Limitations Although DiMS enables cross - entropy - based models to compete with alignment - based models in some cases, it still lags behind the latter in some cases. In addition, DiMS can improve the performance of models trained on the original dataset, but the best performance still requires the application of DiMS on the distilled dataset. Therefore, DiMS still depends on autoregressive models to obtain the best translation quality.

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers for Machine Translation

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Self-Improvement of Non-autoregressive Model Via Sequence-Level Distillation

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Unraveling Key Factors of Knowledge Distillation

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Mixed Distillation Helps Smaller Language Models Reason Better

DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation

Online Distilling from Checkpoints for Neural Machine Translation

SFDDM: Single-fold Distillation for Diffusion models

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Improved Distribution Matching Distillation for Fast Image Synthesis

Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation

DiM: Distilling Dataset into Generative Model

Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes