Escaping the sentence-level paradigm in machine translation

Matt Post,Marcin Junczys-Dowmunt
2024-05-16
Abstract:It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$\rightarrow$EN, EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the success of these three pieces together in improving document-level performance.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem in the field of machine translation (MT): **breaking away from the sentence - level translation paradigm and moving towards document - level translation**. Specifically, the paper focuses on the following points: 1. **The importance of document context**: - Document context is crucial for resolving many ambiguities in translation. - In practical applications, the document is the natural unit of language processing, not just the sentence. 2. **Existing problems and challenges**: - Most current machine translation research and production systems still remain in the sentence - level translation paradigm and fail to fully utilize document - level information. - Large language models (LLMs) have demonstrated their ability in document - level translation, bringing competitive pressure to traditional machine translation. - The lack of sufficient document - level training data and appropriate evaluation methods has hindered the development of document - level translation. 3. **Solutions**: - **Architecture selection**: The paper shows that the standard Transformer architecture is powerful enough and can be used for document - level translation as long as it has sufficient capacity. - **Training data**: Train using document samples only extracted from back - translation data, which are not only easier to obtain but also of higher quality. - **Evaluation method**: Introduce a generative contrastive metric method to better evaluate the performance of document - level translation systems. ### Main contributions of the paper - **Prove the effectiveness of the standard Transformer architecture**: By increasing the model capacity, the standard Transformer architecture can effectively handle document - level translation tasks. - **Propose a new source of training data**: Use back - translation data instead of parallel document data for training, avoiding the influence of low - quality data. - **Improve the evaluation method**: Introduce a generative contrastive metric method to more accurately evaluate the performance of document - level translation systems. ### Summary The goal of this paper is to promote the transformation of machine translation from the sentence - level paradigm to the document - level paradigm, and improve the quality and effectiveness of document - level translation by improving the architecture, optimizing the source of training data, and introducing more effective evaluation methods.