Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in Neural Machine Translation

Tianyu He,Xu Tan,Tao Qin

DOI: https://doi.org/10.48550/arXiv.1908.06259

2019-08-17

Abstract:Neural machine translation (NMT) typically adopts the encoder-decoder framework. A good understanding of the characteristics and functionalities of the encoder and decoder can help to explain the pros and cons of the framework, and design better models for NMT. In this work, we conduct an empirical study on the encoder and the decoder in NMT, taking Transformer as an example. We find that 1) the decoder handles an easier task than the encoder in NMT, 2) the decoder is more sensitive to the input noise than the encoder, and 3) the preceding words/tokens in the decoder provide strong conditional information, which accounts for the two observations above. We hope those observations can shed light on the characteristics of the encoder and decoder and inspire future research on NMT.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the different characteristics and functions of the encoder and decoder in neural machine translation (NMT). Specifically, the author discovers through empirical research: 1. **The task processed by the decoder is simpler than that of the encoder**: - Increasing the number of encoder layers can bring greater performance improvement than increasing the number of decoder layers. - The decoder converges more quickly during the training process, indicating that the task it processes is relatively simple. 2. **The decoder is more sensitive to input noise**: - By adding different levels of noise to the inputs of the encoder and decoder respectively, it is found that the input noise of the decoder will lead to a more significant performance degradation. - Further analysis reveals that the decoder depends on the strong conditional information provided by the previous words/tokens, which explains why the decoder is more sensitive to input noise. 3. **The role of previous words/tokens in the decoder**: - By masking the previous words/tokens and comparing autoregressive NMT with non - autoregressive NMT, it is found that the previous words/tokens provide strong conditional information, which is an important reason why the decoder task is simpler and more sensitive to noise. These findings are helpful for better understanding the characteristics of the encoder and decoder in the NMT framework, thus providing guidance for future research and model design.

Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in Neural Machine Translation

What Works and Doesn’t Work, A Deep Decoder for Neural Machine Translation

Multi-channel Encoder for Neural Machine Translation

Exploiting Reverse Target-Side Contexts for Neural Machine Translation Via Asynchronous Bidirectional Decoding

Explicitly Modeling Word Translations in Neural Machine Translation

Dense Information Flow for Neural Machine Translation.

Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation

Lattice-Based Transformer Encoder for Neural Machine Translation

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Is Encoder-Decoder Redundant for Neural Machine Translation?

Bi-Decoder Augmented Network for Neural Machine Translation.

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Layer-Wise Coordination Between Encoder and Decoder for Neural Machine Translation

Character-Aware Decoder for Translation into Morphologically Rich Languages

Improving Neural Machine Translation Model with Deep Encoding Information

Accelerating Transformer for Neural Machine Translation.

An Efficient Character-Level Neural Machine Translation.

Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input

A Character-Aware Encoder for Neural Machine Translation.

Neural Machine Translation with Word Predictions.

Parallelizing and Optimizing Neural Encoder–Decoder Models Without Padding on Multi-Core Architecture