Abstract:Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practical as non-parallel multilingual data occur much more frequently than parallel corpora and accessing them is easy, although parallel sentences are a considerably more useful resource. In this study, we propose a method of automatic web crawling in order to build topic-aligned comparable corpora, e.g. based on the Wikipedia or <a class="link-external link-http" href="http://Euronews.com" rel="external noopener nofollow">this http URL</a>. We also developed new methods of obtaining parallel sentences from comparable data and proposed methods of filtration of corpora capable of selecting inconsistent or only partially equivalent translations. Our methods are easily scalable to other languages. Evaluation of the quality of the created corpora was performed by analysing the impact of their use on statistical machine translation systems. Experiments were presented on the basis of the Polish-English language pair for texts from different domains, i.e. lectures, phrasebooks, film dialogues, European Parliament proceedings and texts contained medicines leaflets. We also tested a second method of creating parallel corpora based on data from comparable corpora which allows for automatically expanding the existing corpus of sentences about a given domain on the basis of analogies found between them. It does not require, therefore, having past parallel resources in order to train a classifier.

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

Tuned and GPU-accelerated parallel data mining from comparable corpora

Mining Parallel Text from the Web Based on Sentence Alignment

Mining Chinese-English Parallel Corpora from the Web

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Parallel sentences mining from the web

Web-based parallel corpora for statistical machine translation

A Feasible Process for Mining Corpus from Web

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Automatic Acquisition of Large-scale Bilingual Sentence Pair

Unsupervised Parallel Corpus Mining on Web Data

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Building a Large English-Chinese Parallel Corpus from Comparable Patents and Its Experimental Application to SMT

The Web as a Parallel Corpus

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Automatic English-Chinese parallel corpus acquisition and sentences extraction

Automatically Mining Parallel Corpora for Minority Languages from Web Pages