Abstract:Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at <a class="link-external link-https" href="https://github.com/shyyhs/CourseraParallelCorpusMining" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of subtitle translation for online courses (especially lectures) to facilitate non-native English speakers' access to global knowledge. Specifically, the paper concentrates on constructing high-quality machine translation systems for translating lecture subtitles between English and Japanese, as well as English and Chinese. Due to the lack of publicly available parallel corpora, which are datasets of text in two languages aligned with each other, building such translation systems poses challenges. To tackle this issue, the authors propose a framework for mining parallel corpora from public platforms like Coursera. They first developed a sentence alignment algorithm based on dynamic programming, which utilizes the cosine similarity between sentences in machine translations to enhance alignment precision. This method demonstrated superior performance to other approaches such as BERTScore, LASER, or sentBERT in experiments, achieving an F1 score of 96%. Subsequently, the authors used this algorithm to extract approximately 50,000 lines of English-Japanese and English-Chinese parallel corpora from Coursera and created development and test sets through manual filtering to evaluate translation performance. Through machine translation experiments, the authors showed that the mined corpora, when combined with out-of-domain parallel corpora and used with multi-stage fine-tuning techniques, could significantly improve the quality of lecture subtitle translation. Moreover, the research provides guidelines for collecting and cleaning corpora, mining parallel sentences, cleaning noise from mined data, and creating high-quality evaluation splits. To facilitate reproducibility, the authors have made the corpora and the code for creating them publicly available. In summary, the core contribution of the paper lies in proposing an effective method to mine and utilize large-scale parallel corpora, particularly in the educational domain, which will help improve and accelerate the translation of online course subtitles, thereby promoting the dissemination of global knowledge.

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Machine Translation Model based on Non-parallel Corpus and Semi-supervised Transductive Learning

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Unsupervised Parallel Corpus Mining on Web Data

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Development of Translation Database based on Chinese-English parallel corpora

Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Volctrans Parallel Corpus Filtering System for WMT 2020.

Construction and Processing of a Parallel Corpus for Tang Poetry and Song Lyrics

Automatic Construction of Discourse Corpora for Dialogue Translation

Generating Multilingual Parallel Corpus Using Subtitles

Building a Parallel Corpus for English Translation Teaching Based on Computer-Aided Translation Software

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Design and Implementation of Bilingual Parallel Web Page Mining System

Automatic construction of English/Chinese parallel corpora