Abstract:Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at <a class="link-external link-https" href="https://github.com/shyyhs/CourseraParallelCorpusMining" rel="external noopener nofollow">this https URL</a>.

Mining Large-scale Parallel Corpora from Multilingual Patents: an English-Chinese Example and Its Application to SMT

Building a Large English-Chinese Parallel Corpus from Comparable Patents and Its Experimental Application to SMT

The Cultivation of a Chinese-English-Japanese Trilingual Parallel Corpus from Comparable Patents

The Construction of a Chinese-English Patent Parallel Corpus

Mining a Large Chinese-English Corpus from Web

Mining Parallel Knowledge from Comparable Patents

Mining Chinese-English Parallel Corpora from the Web

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Automatic English-Chinese parallel corpus acquisition and sentences extraction

Bilingual Multi-word Expressions, Multiple-correspondence, and Their Cultivation from Parallel Patents: the Chinese-English Case

Constructing of a large-scale Chinese-English parallel corpus

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Mining Parallel Text from the Web Based on Sentence Alignment

Automatically Mining Parallel Corpora for Minority Languages from Web Pages

Chinese-English Parallel Corpus Construction And Its Application

Development of Translation Database based on Chinese-English parallel corpora

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Automatic Acquisition of Large-scale Bilingual Sentence Pair

Design and Implementation of Bilingual Parallel Web Page Mining System