Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in Neural Machine Translation

Tianyu He,Xu Tan,Tao Qin
DOI: https://doi.org/10.48550/arXiv.1908.06259
2019-08-17
Abstract:Neural machine translation (NMT) typically adopts the encoder-decoder framework. A good understanding of the characteristics and functionalities of the encoder and decoder can help to explain the pros and cons of the framework, and design better models for NMT. In this work, we conduct an empirical study on the encoder and the decoder in NMT, taking Transformer as an example. We find that 1) the decoder handles an easier task than the encoder in NMT, 2) the decoder is more sensitive to the input noise than the encoder, and 3) the preceding words/tokens in the decoder provide strong conditional information, which accounts for the two observations above. We hope those observations can shed light on the characteristics of the encoder and decoder and inspire future research on NMT.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the different characteristics and functions of the encoder and decoder in neural machine translation (NMT). Specifically, the author discovers through empirical research: 1. **The task processed by the decoder is simpler than that of the encoder**: - Increasing the number of encoder layers can bring greater performance improvement than increasing the number of decoder layers. - The decoder converges more quickly during the training process, indicating that the task it processes is relatively simple. 2. **The decoder is more sensitive to input noise**: - By adding different levels of noise to the inputs of the encoder and decoder respectively, it is found that the input noise of the decoder will lead to a more significant performance degradation. - Further analysis reveals that the decoder depends on the strong conditional information provided by the previous words/tokens, which explains why the decoder is more sensitive to input noise. 3. **The role of previous words/tokens in the decoder**: - By masking the previous words/tokens and comparing autoregressive NMT with non - autoregressive NMT, it is found that the previous words/tokens provide strong conditional information, which is an important reason why the decoder task is simpler and more sensitive to noise. These findings are helpful for better understanding the characteristics of the encoder and decoder in the NMT framework, thus providing guidance for future research and model design.