An aligned corpus of Spanish bibles

Gerardo Sierra,Gemma Bel-Enguix,Ameyali Díaz-Velasco,Natalia Guerrero-Cerón,Núria Bel
DOI: https://doi.org/10.1007/s10579-024-09726-y
2024-03-16
Language Resources and Evaluation
Abstract:We present a comprehensive and valuable resource in the form of an aligned parallel corpus comprising translations of the Bible in Spanish. Our collection encompasses a total of eleven Bibles, originating from diverse centuries (XVI, XIX, XX), various religious denominations (Protestant, Catholic), and geographical regions (Spain, Latin America). The process of aligning the verses across these translations has been meticulously carried out, ensuring that the content is organized in a coherent manner. As a result, this corpus serves as a useful convenient resource for various linguistic analyses, including paraphrase detection, semantic clustering, and the exploration of biases present within the texts. To illustrate the utility of this resource, we provide several examples that demonstrate how it can be effectively employed in these applications.
computer science, interdisciplinary applications
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to construct an aligned corpus containing Spanish Bible translations. Specifically, the goals of the paper include: 1. **Constructing an Aligned Corpus**: The authors collected 11 different versions of the Spanish Bible, which come from different centuries (16th century, 19th century, 20th century), religious denominations (Protestant, Catholic), and regions (Spain, Latin America). By aligning these versions, they ensure that the content remains consistent across all versions. 2. **Supporting Various Linguistic Analyses**: This aligned corpus can be used for various linguistic studies, including paraphrase detection, semantic clustering, and bias exploration in texts. Examples are provided to demonstrate the effectiveness and practicality of this resource in these applications. 3. **Addressing Challenges in the Alignment Process**: Due to differences in chapter and verse divisions among different versions of the Bible, the alignment process faces certain challenges. The authors use a combination of automatic alignment and manual proofreading to ensure the accuracy of the alignment. ### Keywords - Aligned Corpus - Paraphrase Detection - Semantic Clustering - Subjective Bias - Bible Corpus - Dialectal Differences ### Research Background - The Bible is the most translated book in the world, and its extensive and diverse nature makes it an ideal resource for parallel corpora in computational linguistics. - Different versions of Bible translations not only reflect linguistic changes but also include religious, political, and theological differences. - By constructing an aligned corpus, it is possible to better study issues such as language change, bias, and ideology. ### Methods - **Data Collection**: 11 different versions of the Spanish Bible were selected, ensuring diversity in terms of time, geography, and religion. - **Alignment Process**: Python programs were used to automatically align the chapters and verses of each version, followed by manual proofreading to ensure content consistency. - **Application Examples**: Specific examples are provided to show how this corpus can be used for paraphrase detection, semantic clustering, and bias analysis. ### Conclusion - This aligned corpus is a valuable resource that can be used for various linguistic studies, particularly paraphrase detection, semantic clustering, and bias analysis. - Future research can further expand the application scope of this corpus, exploring more issues in linguistics and computational linguistics.