Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Haiyue Song,Raj Dabre,Chenhui Chu,Atsushi Fujita,Sadao Kurohashi
2023-11-07
Abstract:Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at <a class="link-external link-https" href="https://github.com/shyyhs/CourseraParallelCorpusMining" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of subtitle translation for online courses (especially lectures) to facilitate non-native English speakers' access to global knowledge. Specifically, the paper concentrates on constructing high-quality machine translation systems for translating lecture subtitles between English and Japanese, as well as English and Chinese. Due to the lack of publicly available parallel corpora, which are datasets of text in two languages aligned with each other, building such translation systems poses challenges. To tackle this issue, the authors propose a framework for mining parallel corpora from public platforms like Coursera. They first developed a sentence alignment algorithm based on dynamic programming, which utilizes the cosine similarity between sentences in machine translations to enhance alignment precision. This method demonstrated superior performance to other approaches such as BERTScore, LASER, or sentBERT in experiments, achieving an F1 score of 96%. Subsequently, the authors used this algorithm to extract approximately 50,000 lines of English-Japanese and English-Chinese parallel corpora from Coursera and created development and test sets through manual filtering to evaluate translation performance. Through machine translation experiments, the authors showed that the mined corpora, when combined with out-of-domain parallel corpora and used with multi-stage fine-tuning techniques, could significantly improve the quality of lecture subtitle translation. Moreover, the research provides guidelines for collecting and cleaning corpora, mining parallel sentences, cleaning noise from mined data, and creating high-quality evaluation splits. To facilitate reproducibility, the authors have made the corpora and the code for creating them publicly available. In summary, the core contribution of the paper lies in proposing an effective method to mine and utilize large-scale parallel corpora, particularly in the educational domain, which will help improve and accelerate the translation of online course subtitles, thereby promoting the dissemination of global knowledge.