Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

Yasmin Moslem,Gianfranco Romani,Mahdi Molaei,Rejwanul Haque,John D. Kelleher,Andy Way
2023-10-23
Abstract:This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of machine translation (MT) systems in the translation of domain - specific terms. Specifically, the goal of the paper is to better integrate pre - approved technical terms during the translation process, thereby enhancing the professionalism of cross - language communication and understanding. To achieve this goal, the research team designed a multi - step process, using large language models (LLMs) to generate bilingual synthetic data and enhancing the usage rate of terms in translation results through automatic post - editing techniques with term constraints. ### Main Research Questions: 1. **Term Integration Challenges**: How to effectively integrate domain - specific professional terms in the machine translation process to improve the accuracy and professionalism of translation? 2. **Automatic Post - Editing with Term Constraints**: How to use large language models to insert missing terms in the translated text while maintaining the fluency and accuracy of the translation? ### Research Methods: 1. **Generate Bilingual Synthetic Data**: Use large language models (such as ChatGPT) to generate bilingual synthetic data based on the provided terms. 2. **Hybrid Fine - Tuning**: Use the generated term - based synthetic data and a random sample of the original general training data to fine - tune the general encoder - decoder MT model. 3. **Generate Translations**: Use the fine - tuned MT model to generate translations for the development set, test set, and blind test set. 4. **Automatic Post - Editing with Term Constraints**: Use large language models (such as ChatGPT) to post - edit translations that do not contain the required terms, ensuring that all pre - approved terms are correctly incorporated into the translations. ### Experimental Results: - **Significant Increase in Term Usage Rate**: In the blind test data set, the term usage rate increased from an average of 36.67% in the baseline model to 72.88% after LLM post - editing. - **Improvement in Translation Quality**: Automatic post - editing with term constraints not only increased the term usage rate but also improved the overall quality of translation, especially for the three language pairs of German - English (DE - EN), English - Czech (EN - CS), and Chinese - English (ZH - EN). ### Conclusions: By combining term - based synthetic data generation, hybrid fine - tuning, and automatic post - editing with term constraints, the research team successfully improved the accuracy and professionalism of machine translation systems in the translation of domain - specific terms. This method not only significantly increased the term usage rate but also improved the overall quality of translation, demonstrating the potential and advantages of large language models in solving complex translation tasks.