Translating away Translationese without Parallel Data

Rricha Jalota,Koel Dutta Chowdhury,Cristina España-Bonet,Josef van Genabith
2023-10-29
Abstract:Translated texts exhibit systematic linguistic differences compared to original texts in the same language, and these differences are referred to as translationese. Translationese has effects on various cross-lingual natural language processing tasks, potentially leading to biased results. In this paper, we explore a novel approach to reduce translationese in translated texts: translation-based style transfer. As there are no parallel human-translated and original data in the same language, we use a self-supervised approach that can learn from comparable (rather than parallel) mono-lingual original and translated data. However, even this self-supervised approach requires some parallel data for validation. We show how we can eliminate the need for parallel validation data by combining the self-supervised loss with an unsupervised loss. This unsupervised loss leverages the original language model loss over the style-transferred output and a semantic similarity loss between the input and style-transferred output. We evaluate our approach in terms of original vs. translationese binary classification in addition to measuring content preservation and target-style fluency. The results show that our approach is able to reduce translationese classifier accuracy to a level of a random classifier after style transfer while adequately preserving the content and fluency in the target original style.
Computation and Language
What problem does this paper attempt to address?
This paper aims to address the issue of "translationese" in translated texts. Translationese refers to the systematic linguistic differences that translated texts exhibit compared to original works in the same language. These differences can affect the performance of cross-lingual natural language processing tasks, leading to biased results. To tackle this problem, the researchers propose a novel method—Translation-based Style Transfer, which can reduce translationese in translated texts without parallel data. Specifically, this method employs a self-supervised neural machine translation system and applies it to the style transfer task. Due to the lack of parallel human translation and original data, the researchers further propose a joint self-supervised and unsupervised learning criterion, which combines language model loss and semantic similarity loss, thereby eliminating the need for parallel data during training and validation. The main contributions of this method include: 1. For the first time, framing the reduction of translationese in human-translated texts as a monolingual translation style transfer task, allowing for direct evaluation of the surface form of the generated output. 2. Introducing a joint self-supervised and unsupervised learning criterion, which does not require parallel original-translation datasets for training and validation. 3. Experimental results show that this method can significantly reduce the accuracy of translationese classifiers to the level of random classifiers, indicating that the method successfully eliminates translationese signals in the output. 4. Providing extensive quantitative and qualitative analyses to assess the method's ability to mitigate translationese while maintaining content integrity and fluency. In summary, this paper proposes an innovative approach to mitigate the issue of translationese in translated texts, which is significant for improving the performance of cross-lingual natural language processing tasks.