Abstract:Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to explore and address the issue of translation errors in cross-lingual learning benchmarks, particularly in low-resource languages. Specifically, the authors found: 1. **Translation Inconsistency**: In multilingual benchmarks (such as XNLI), there are translation inconsistencies when translating data from English to the target language, which significantly affect low-resource languages (such as Hindi and Urdu). 2. **Performance Gap**: By comparing the performance gap between human translations and machine translations in zero-shot evaluations, the authors found that the performance gap for low-resource languages is significantly larger than for high-resource languages. 3. **Label Inconsistency**: By re-annotating human translation data for Hindi and Urdu, the authors found significant discrepancies between the labels of these languages and the original English labels. ### Main Contributions 1. **Identifying Low-Quality Translations**: A practical method is proposed to identify low-quality human translations by comparing the performance of human translations and machine translations. 2. **Persistence Across Different Training/Testing Settings**: It was found that translation errors persist across various training/testing settings, including training with machine translation data and back-translation generated paraphrases. 3. **Manual Annotation Verification**: A portion of the natural language inference (NLI) data for Hindi and Urdu was manually annotated, revealing significant differences between the newly annotated labels and the labels projected from the original English sentences. ### Experimental Setup - **Tasks and Models**: The focus is primarily on the XNLI benchmark, a three-classification task used to check whether a premise entails, contradicts, or is neutral to a hypothesis. - **Training and Testing Variants**: - **ORIG**: Original English training data. - **Backtranslated-train (B-TRAIN)**: English paraphrases generated through back-translation, using Spanish as the intermediary language. - **Testing Variants**: - **Zero-shot (ZS)**: Human-translated development/test sets in the target language. - **Translate-test (TT)**: Machine-translated development/test sets in the target language. - **Translate-from-English (TE)**: Machine translations from original English to the target language. - **Backtranslation-via-target (BT)**: Machine translations from original English to the target language and back to English. ### Results - **Using Original English Training Set**: Table 1 shows the XNLI accuracy using original English training data across different testing variants. The performance gap for low-resource languages (such as Urdu and Swahili) is significantly higher than for high-resource languages (such as French and Spanish). - **Using Translated Training Set**: Table 2 shows the test accuracy of the XLMR model trained with B-TRAIN across different testing variants. Although overall performance improved, the performance gap for low-resource languages remains substantial. ### Ethical Statement The authors emphasize adherence to ethical practices throughout the research, ensuring fair compensation for human annotators and compliance with Google Translate's terms and conditions. ### Limitations - For tasks where output labels directly correspond to input text (such as part-of-speech tagging, question answering, etc.), using the techniques proposed in this paper would be more complex because translation may alter word order, affecting the output labels. - The paper does not provide specific methods to resolve translation errors but calls for additional checks when collecting translations for low-resource languages.

Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Translation Artifacts in Cross-lingual Transfer Learning

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification

Handling Syntactic Divergence in Low-resource Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

Machine Translation of Low-Resource Indo-European Languages

Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation

Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

Machine Translation for Accessible Multi-Language Text Analysis

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?

Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets

A Set of Recommendations for Assessing Human-Machine Parity in Language Translation