Abstract:We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to construct an aligned corpus containing Spanish Bible translations. Specifically, the goals of the paper include: 1. **Constructing an Aligned Corpus**: The authors collected 11 different versions of the Spanish Bible, which come from different centuries (16th century, 19th century, 20th century), religious denominations (Protestant, Catholic), and regions (Spain, Latin America). By aligning these versions, they ensure that the content remains consistent across all versions. 2. **Supporting Various Linguistic Analyses**: This aligned corpus can be used for various linguistic studies, including paraphrase detection, semantic clustering, and bias exploration in texts. Examples are provided to demonstrate the effectiveness and practicality of this resource in these applications. 3. **Addressing Challenges in the Alignment Process**: Due to differences in chapter and verse divisions among different versions of the Bible, the alignment process faces certain challenges. The authors use a combination of automatic alignment and manual proofreading to ensure the accuracy of the alignment. ### Keywords - Aligned Corpus - Paraphrase Detection - Semantic Clustering - Subjective Bias - Bible Corpus - Dialectal Differences ### Research Background - The Bible is the most translated book in the world, and its extensive and diverse nature makes it an ideal resource for parallel corpora in computational linguistics. - Different versions of Bible translations not only reflect linguistic changes but also include religious, political, and theological differences. - By constructing an aligned corpus, it is possible to better study issues such as language change, bias, and ideology. ### Methods - **Data Collection**: 11 different versions of the Spanish Bible were selected, ensuring diversity in terms of time, geography, and religion. - **Alignment Process**: Python programs were used to automatically align the chapters and verses of each version, followed by manual proofreading to ensure content consistency. - **Application Examples**: Specific examples are provided to show how this corpus can be used for paraphrase detection, semantic clustering, and bias analysis. ### Conclusion - This aligned corpus is a valuable resource that can be used for various linguistic studies, particularly paraphrase detection, semantic clustering, and bias analysis. - Future research can further expand the application scope of this corpus, exploring more issues in linguistics and computational linguistics.

An aligned corpus of Spanish bibles

A massively parallel corpus: the Bible in 100 languages

Automated annotation of parallel bible corpora with cross-lingual semantic concordance

Efficacy of ByT5 in Multilingual Translation of Biblical Texts for Underrepresented Languages

English progressives translated into Spanish: corpus-based functional equivalents

An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling

The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

A Parallel Corpus of Translationese

esCorpius: A Massive Spanish Crawling Corpus

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

Spanish Abstract Meaning Representation: Annotation of a General Corpus

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Manual Annotation of Translational Equivalence: The Blinker Project

Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Development of Translation Database based on Chinese-English parallel corpora

Word Sense Disambiguation Using English-Spanish Aligned Phrases over Comparable Corpora

Chinese-English Parallel Corpus Construction And Its Application

Evaluating prose style transfer with the Bible

Evaluating automatic sentence alignment approaches on English-Slovak sentences

The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

A Survey of Spanish Clinical Language Models