A primer on getting neologisms from foreign languages to under-resourced languages

Luis Camacho
2023-03-07
Abstract:Mainly due to lack of support, most under-resourced languages have a reduced lexicon in most realms and domains of increasing importance, then their speakers need to significantly augment it. Although neologisms should arise from the languages themselves, external sources are widely accepted. However, we dispute the "common sense" of using the imposed official languages, which are highly probably a legacy of colonialism, as the only source, and we propose to introduce neologisms from any language as long as these neologisms "sound like" native words of the target languages.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of insufficient vocabulary in under-resourced languages (such as Quechua). Due to a lack of support, most under-resourced languages have very limited vocabulary in many important fields and domains, necessitating a significant expansion of their lexicon for their users. While new words (neologisms) should ideally originate from these languages themselves, external sources are also widely accepted. However, the authors oppose the practice of introducing new words solely from one foreign language (usually the officially mandated language), arguing that this may be a remnant of colonial influence. Therefore, the paper proposes a method to introduce new words from any language that conforms to the phonology of the target language. ### Main Contributions 1. **Automated Search Algorithm**: Developed a set of code to search for new words that conform to Quechua phonology from any language with a phonetic database. 2. **Database Construction**: Proposed a database containing 41,722 suggested new words. ### Methodology 1. **Data Sources**: Utilized two main data sources—Open Dictionary and Wikipron, both of which provide International Phonetic Alphabet (IPA) representations for multiple languages. 2. **Algorithm Implementation**: Wrote a script `GettingNeologisms.ipynb` to collect all spelling and pronunciation rules of Quechua. After executing the script, two intermediate files were generated, containing the words that met the criteria from the two data sources. 3. **Word Translation**: Used the `SincronizationWordNet.ipynb` script to match the found words with their corresponding languages and obtain English translations through WordNet. 4. **Phonetic Conversion**: Converted the IPA representations of the selected new words into Quechua orthography. 5. **Final Translation**: Used the `GOOGLETRANSLATE` function in Google Spreadsheets to translate the new words from the original language into Spanish. ### Results - **Open Dictionary Data Source**: Processed a total of 3,838,348 words, found 14,970 new words that met the criteria, of which 1,768 were translated into English. - **Wikipron Data Source**: Processed a total of 3,506,258 words, found 26,752 new words that met the criteria, of which 6,118 were translated into English. ### Discussion - **Relevance**: Introducing new words is particularly important for under-resourced languages, but due to social and political reasons, these new words may not be accepted. The authors believe that the issue lies not in the foreign origin of the new words, but in whether these new words are generated by consensus from authoritative or representative groups. - **Future Work**: Expand data sources, improve the recognition rate of new words, and adjust selection criteria to make more words meet the requirements. Additionally, standardize the code so it can be applied to other languages. ### Conclusion - **Importance of Loanwords**: Loanwords are an important source of vocabulary enrichment. In the internet age, loanwords have become a source of new words that global audiences encounter daily. - **New Method**: This paper proposes a new method to introduce new words from any foreign language that conforms to Quechua phonology, enhancing Quechua's expressive capacity and the identity of its users. - **Social Acceptance**: Consensus and social acceptance are key to whether new words can exist long-term and enrich the language's vocabulary. ### Data Availability Statement The data supporting the findings of this study can be accessed in the following GitHub repositories: - [GlosaCSV.xlsx](https://github.com/luiscamachocaballero/QuechuaNeologisms) - [GlosaTSV.xlsx](https://github.com/luiscamachocaballero/QuechuaNeologisms) These data are sourced from Open Dictionary and Wikipron, both of which are public resources.