The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA)

Simon Gonzalez
2024-07-22
Abstract:Pronunciation dictionaries are an important component in the process of speech forced alignment. The accuracy of these dictionaries has a strong effect on the aligned speech data since they help the mapping between orthographic transcriptions and acoustic signals. In this paper, I present the creation of a comprehensive pronunciation dictionary in Spanish (ESPADA) that can be used in most of the dialect variants of Spanish data. Current dictionaries focus on specific regional variants, but with the flexible nature of our tool, it can be readily applied to capture the most common phonetic differences across major dialectal variants. We propose improvements to current pronunciation dictionaries as well as mapping other relevant annotations such as morphological and lexical information. In terms of size, it is currently the most complete dictionary with more than 628,000 entries, representing words from 16 countries. All entries come with their corresponding pronunciations, morphological and lexical tagging, and other relevant information for phonetic analysis: stress patterns, phonotactics, IPA transcriptions, and more. This aims to equip socio-phonetic researchers with a complete open-source tool that enhances dialectal research within socio-phonetic frameworks in the Spanish language.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address key issues in the forced alignment process within Spanish sociophonetic research, specifically regarding the creation of a comprehensive and flexible pronunciation dictionary (ESPADA) to improve the accuracy of phonetic and lexical annotation across different dialectal variants. Here are the core issues the paper attempts to solve: 1. **Standardization and Flexibility of the Pronunciation Dictionary**: Existing pronunciation dictionaries often focus on specific regional variants, whereas the ESPADA dictionary aims to cover most Spanish dialects. Through its flexible tool nature, it can capture the most common phonetic differences between major dialect variants. 2. **Improving the Accuracy of Forced Alignment**: The paper points out that the accuracy of forced alignment largely depends on the quality of the pronunciation dictionary. The ESPADA dictionary aims to enhance the accuracy of forced alignment by providing detailed phonetic representations, stress patterns, and phonotactic structures. 3. **Integration of Comprehensive Linguistic Annotations**: In addition to phonetic information, ESPADA also includes annotations of morphology and lexical information, which is not common in existing dictionaries. This comprehensive approach to annotation is intended to provide sociophonetic researchers with a complete open-source tool for dialect studies. 4. **Handling Sociophonetic Variation**: The paper pays special attention to accurately representing sociophonetic variation in the dictionary, such as vowel stress and consonant weakening in different dialects, and proposes measures for improvement. 5. **Adaptability and Extensibility**: The design of the ESPADA dictionary takes into account the need for future expansion and user customization, allowing users to select different dialectal features according to their research needs, thus adapting to the study of various Spanish variants. 6. **Scale and Coverage**: ESPADA is one of the most comprehensive Spanish pronunciation dictionaries to date, containing over 628,000 entries and covering vocabulary from 16 countries, making it a powerful resource for sociophonetic research. By addressing the above issues, the goal of the ESPADA dictionary is to provide a more accurate, comprehensive, and user-friendly tool for sociophonetic research, facilitating the development of studies on Spanish dialects.