On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

Liang Chen,Shuming Ma,Dongdong Zhang,Furu Wei,Baobao Chang
2023-06-02
Abstract:While multilingual neural machine translation has achieved great success, it suffers from the off-target issue, where the translation is in the wrong language. This problem is more pronounced on zero-shot translation tasks. In this work, we find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance (i.e., KL-divergence) between two languages' vocabularies is related with a higher off-target rate. We also find that solely isolating the vocab of different languages in the decoder can alleviate the problem. Motivated by the findings, we propose Language Aware Vocabulary Sharing (LAVS), a simple and effective algorithm to construct the multilingual vocabulary, that greatly alleviates the off-target problem of the translation model by increasing the KL-divergence between languages. We conduct experiments on a multilingual machine translation benchmark in 11 languages. Experiments show that the off-target rate for 90 translation tasks is reduced from 29\% to 8\%, while the overall BLEU score is improved by an average of 1.9 points without extra training cost or sacrificing the supervised directions' performance. We release the code at <a class="link-external link-https" href="https://github.com/PKUnlp-icler/Off-Target-MNMT" rel="external noopener nofollow">this https URL</a> for reproduction.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to solve the off - target problem in multilingual neural machine translation (MNMT), especially in zero - shot translation tasks. Specifically, the off - target problem refers to the translation result deviating from the specified target language, which is particularly serious in zero - shot translation tasks. For example, in the absence of parallel training data for a specific language pair, the model may incorrectly translate the source - language sentence into other non - target languages instead of the specified target language. ### Main Findings and Solutions 1. **Failure to Encode the Target - Language Signal Leads to Off - Target**: - The paper finds that if the model fails to encode a discriminative target - language signal in the hidden representation, it will lead to the off - target problem. By using t - SNE to visualize the output of the encoder, it can be seen that for high - resource language pairs, their representations are often chaotic and mixed, which makes it difficult for the decoder to distinguish the target - language signal. 2. **Lexical Similarity Is Related to the Off - Target Rate**: - Research shows that the lexical similarity between the target language and the source language (measured by KL divergence) is negatively correlated with the off - target rate. That is, the higher the lexical similarity, the higher the off - target rate. This is because high lexical similarity leads to more shared vocabulary, thus increasing the risk of the model confusing the target language. 3. **Shared Vocabulary in the Decoder May Lead to Bias in Zero - Shot Translation Directions**: - The existence of shared vocabulary makes it difficult for the decoder to directly identify the target language when generating output, resulting in representation degradation and the off - target problem. 4. **Separating Vocabularies of Different Languages Is Effective but Costly**: - Completely separating the vocabularies of different languages can significantly reduce the off - target rate, but it will greatly increase the number of model parameters, from 308M to 515M. ### Proposed Method: Language - Aware Vocabulary Sharing (LA VS) In order to solve the off - target problem without increasing the number of model parameters, the paper proposes a new algorithm - Language - Aware Vocabulary Sharing (LA VS). This method increases the KL divergence of the vocabulary distribution between different languages by splitting some of the shared vocabulary into language - specific vocabulary, thereby improving the discrimination ability of the model. ### Experimental Results - **Significant Improvement in Zero - Shot Translation Performance**: - After using LA VS, the average off - target rate of 90 zero - shot translation directions is reduced from 29% to 8%, and the BLEU score is increased by an average of 1.9 points without additional training costs or sacrificing the performance of the supervised directions. - **Further Performance Improvement by Combining Back - Translation**: - After combining the back - translation technique, the BLEU score of LA VS in the zero - shot directions is further increased, reaching an average of 16.8 points, and the off - target rate is reduced to 0%. ### Summary This paper proposes the Language - Aware Vocabulary Sharing (LA VS) method by in - depth analysis of the off - target problem in multilingual neural machine translation, which effectively solves the off - target problem in zero - shot translation, significantly improves translation performance, and at the same time maintains the efficiency of the model and the performance of the supervised directions.