Abstract:Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practical as non-parallel multilingual data occur much more frequently than parallel corpora and accessing them is easy, although parallel sentences are a considerably more useful resource. In this study, we propose a method of automatic web crawling in order to build topic-aligned comparable corpora, e.g. based on the Wikipedia or <a class="link-external link-http" href="http://Euronews.com" rel="external noopener nofollow">this http URL</a>. We also developed new methods of obtaining parallel sentences from comparable data and proposed methods of filtration of corpora capable of selecting inconsistent or only partially equivalent translations. Our methods are easily scalable to other languages. Evaluation of the quality of the created corpora was performed by analysing the impact of their use on statistical machine translation systems. Experiments were presented on the basis of the Polish-English language pair for texts from different domains, i.e. lectures, phrasebooks, film dialogues, European Parliament proceedings and texts contained medicines leaflets. We also tested a second method of creating parallel corpora based on data from comparable corpora which allows for automatically expanding the existing corpus of sentences about a given domain on the basis of analogies found between them. It does not require, therefore, having past parallel resources in order to train a classifier.

A massively parallel corpus: the Bible in 100 languages

The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Automated annotation of parallel bible corpora with cross-lingual semantic concordance

Development of Translation Database based on Chinese-English parallel corpora

Efficacy of ByT5 in Multilingual Translation of Biblical Texts for Underrepresented Languages

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

A Parallel Corpus of Translationese

An aligned corpus of Spanish bibles

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

A large English–Thai parallel corpus from the web and machine-generated text

Evaluating prose style transfer with the Bible

Corpus-based translation research: its development and implications for general, literary and Bible translation

Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Extended Parallel Corpus for Amharic-English Machine Translation

Large language model for Bible sentiment analysis: Sermon on the Mount

Chinese-English Parallel Corpus Construction And Its Application

Constructing of a large-scale Chinese-English parallel corpus

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora