Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Ashish Sunil Agrawal,Barah Fazili,Preethi Jyothi
2024-02-03
Abstract:Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to explore and address the issue of translation errors in cross-lingual learning benchmarks, particularly in low-resource languages. Specifically, the authors found: 1. **Translation Inconsistency**: In multilingual benchmarks (such as XNLI), there are translation inconsistencies when translating data from English to the target language, which significantly affect low-resource languages (such as Hindi and Urdu). 2. **Performance Gap**: By comparing the performance gap between human translations and machine translations in zero-shot evaluations, the authors found that the performance gap for low-resource languages is significantly larger than for high-resource languages. 3. **Label Inconsistency**: By re-annotating human translation data for Hindi and Urdu, the authors found significant discrepancies between the labels of these languages and the original English labels. ### Main Contributions 1. **Identifying Low-Quality Translations**: A practical method is proposed to identify low-quality human translations by comparing the performance of human translations and machine translations. 2. **Persistence Across Different Training/Testing Settings**: It was found that translation errors persist across various training/testing settings, including training with machine translation data and back-translation generated paraphrases. 3. **Manual Annotation Verification**: A portion of the natural language inference (NLI) data for Hindi and Urdu was manually annotated, revealing significant differences between the newly annotated labels and the labels projected from the original English sentences. ### Experimental Setup - **Tasks and Models**: The focus is primarily on the XNLI benchmark, a three-classification task used to check whether a premise entails, contradicts, or is neutral to a hypothesis. - **Training and Testing Variants**: - **ORIG**: Original English training data. - **Backtranslated-train (B-TRAIN)**: English paraphrases generated through back-translation, using Spanish as the intermediary language. - **Testing Variants**: - **Zero-shot (ZS)**: Human-translated development/test sets in the target language. - **Translate-test (TT)**: Machine-translated development/test sets in the target language. - **Translate-from-English (TE)**: Machine translations from original English to the target language. - **Backtranslation-via-target (BT)**: Machine translations from original English to the target language and back to English. ### Results - **Using Original English Training Set**: Table 1 shows the XNLI accuracy using original English training data across different testing variants. The performance gap for low-resource languages (such as Urdu and Swahili) is significantly higher than for high-resource languages (such as French and Spanish). - **Using Translated Training Set**: Table 2 shows the test accuracy of the XLMR model trained with B-TRAIN across different testing variants. Although overall performance improved, the performance gap for low-resource languages remains substantial. ### Ethical Statement The authors emphasize adherence to ethical practices throughout the research, ensuring fair compensation for human annotators and compliance with Google Translate's terms and conditions. ### Limitations - For tasks where output labels directly correspond to input text (such as part-of-speech tagging, question answering, etc.), using the techniques proposed in this paper would be more complex because translation may alter word order, affecting the output labels. - The paper does not provide specific methods to resolve translation errors but calls for additional checks when collecting translations for low-resource languages.