Abstract:While multilingual neural machine translation has achieved great success, it suffers from the off-target issue, where the translation is in the wrong language. This problem is more pronounced on zero-shot translation tasks. In this work, we find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance (i.e., KL-divergence) between two languages' vocabularies is related with a higher off-target rate. We also find that solely isolating the vocab of different languages in the decoder can alleviate the problem. Motivated by the findings, we propose Language Aware Vocabulary Sharing (LAVS), a simple and effective algorithm to construct the multilingual vocabulary, that greatly alleviates the off-target problem of the translation model by increasing the KL-divergence between languages. We conduct experiments on a multilingual machine translation benchmark in 11 languages. Experiments show that the off-target rate for 90 translation tasks is reduced from 29\% to 8\%, while the overall BLEU score is improved by an average of 1.9 points without extra training cost or sacrificing the supervised directions' performance. We release the code at <a class="link-external link-https" href="https://github.com/PKUnlp-icler/Off-Target-MNMT" rel="external noopener nofollow">this https URL</a> for reproduction.

What problem does this paper attempt to address?

This paper aims to solve the off - target problem in multilingual neural machine translation (MNMT), especially in zero - shot translation tasks. Specifically, the off - target problem refers to the translation result deviating from the specified target language, which is particularly serious in zero - shot translation tasks. For example, in the absence of parallel training data for a specific language pair, the model may incorrectly translate the source - language sentence into other non - target languages instead of the specified target language. ### Main Findings and Solutions 1. **Failure to Encode the Target - Language Signal Leads to Off - Target**: - The paper finds that if the model fails to encode a discriminative target - language signal in the hidden representation, it will lead to the off - target problem. By using t - SNE to visualize the output of the encoder, it can be seen that for high - resource language pairs, their representations are often chaotic and mixed, which makes it difficult for the decoder to distinguish the target - language signal. 2. **Lexical Similarity Is Related to the Off - Target Rate**: - Research shows that the lexical similarity between the target language and the source language (measured by KL divergence) is negatively correlated with the off - target rate. That is, the higher the lexical similarity, the higher the off - target rate. This is because high lexical similarity leads to more shared vocabulary, thus increasing the risk of the model confusing the target language. 3. **Shared Vocabulary in the Decoder May Lead to Bias in Zero - Shot Translation Directions**: - The existence of shared vocabulary makes it difficult for the decoder to directly identify the target language when generating output, resulting in representation degradation and the off - target problem. 4. **Separating Vocabularies of Different Languages Is Effective but Costly**: - Completely separating the vocabularies of different languages can significantly reduce the off - target rate, but it will greatly increase the number of model parameters, from 308M to 515M. ### Proposed Method: Language - Aware Vocabulary Sharing (LA VS) In order to solve the off - target problem without increasing the number of model parameters, the paper proposes a new algorithm - Language - Aware Vocabulary Sharing (LA VS). This method increases the KL divergence of the vocabulary distribution between different languages by splitting some of the shared vocabulary into language - specific vocabulary, thereby improving the discrimination ability of the model. ### Experimental Results - **Significant Improvement in Zero - Shot Translation Performance**: - After using LA VS, the average off - target rate of 90 zero - shot translation directions is reduced from 29% to 8%, and the BLEU score is increased by an average of 1.9 points without additional training costs or sacrificing the performance of the supervised directions. - **Further Performance Improvement by Combining Back - Translation**: - After combining the back - translation technique, the BLEU score of LA VS in the zero - shot directions is further increased, reaching an average of 16.8 points, and the off - target rate is reduced to 0%. ### Summary This paper proposes the Language - Aware Vocabulary Sharing (LA VS) method by in - depth analysis of the off - target problem in multilingual neural machine translation, which effectively solves the off - target problem in zero - shot translation, significantly improves translation performance, and at the same time maintains the efficiency of the model and the performance of the supervised directions.

On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

Understanding and Mitigating the Uncertainty in Zero-Shot Translation

The Missing Ingredient in Zero-Shot Neural Machine Translation

Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance

Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation

Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders

LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation

Improving Multilingual Translation by Representation and Gradient Regularization

Improving Zero-shot Translation with Language-Independent Constraints

Improving Zero-Shot Translation of Low-Resource Languages

On the Shortcut Learning in Multilingual Neural Machine Translation

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

A Study of Multilingual Neural Machine Translation

Language Tags Matter for Zero-Shot Neural Machine Translation

Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features

Improving Many-to-Many Neural Machine Translation Via Selective and Aligned Online Data Augmentation

Multilingual Neural Machine Translation for Zero-Resource Languages

Improving Zero-shot Multilingual Neural Machine Translation by Leveraging Cross-lingual Consistency Regularization

Improving Zero-Shot Multilingual Translation with Universal Representations and Cross-Mappings