Phonetic Segmentation of the UCLA Phonetics Lab Archive

Eleanor Chodroff,Blaž Pažon,Annie Baker,Steven Moran
2024-03-28
Abstract:Research in speech technologies and comparative linguistics depends on access to diverse and accessible speech data. The UCLA Phonetics Lab Archive is one of the earliest multilingual speech corpora, with long-form audio recordings and phonetic transcriptions for 314 languages (Ladefoged et al., 2009). Recently, 95 of these languages were time-aligned with word-level phonetic transcriptions (Li et al., 2021). Here we present VoxAngeles, a corpus of audited phonetic transcriptions and phone-level alignments of the UCLA Phonetics Lab Archive, which uses the 95-language CMU re-release as our starting point. VoxAngeles also includes word- and phone-level segmentations from the original UCLA corpus, as well as phonetic measurements of word and phone durations, vowel formants, and vowel f0. This corpus enhances the usability of the original data, particularly for quantitative phonetic typology, as demonstrated through a case study of vowel intrinsic f0. We also discuss the utility of the VoxAngeles corpus for general research and pedagogy in crosslinguistic phonetics, as well as for low-resource and multilingual speech technologies. VoxAngeles is free to download and use under a CC-BY-NC 4.0 license.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the availability and diversity of cross-linguistic speech data, particularly in low-resource and multilingual speech technologies. Specifically, the paper introduces the VoxAngeles corpus, a manually corrected and reviewed phone-level aligned version of the UCLA Phonetics Lab Archive. Through this work, the authors aim to enhance the usability of the original data, especially for quantitative phonological typology research, and demonstrate the utility of this corpus through a case study on vowel intrinsic f0. ### Main Objectives: 1. **Improve Data Availability**: By providing time-aligned speech transcriptions and phone-level segmentation, make the original data more accessible for various research and applications. 2. **Enhance Data Quality**: Ensure the accuracy and consistency of the data through manual correction and review. 3. **Support Cross-Linguistic Research**: Provide a high-quality speech dataset that includes multiple languages to support cross-linguistic speech technology and phonological research. 4. **Promote Education and Teaching**: Offer valuable resources for the research and teaching of cross-linguistic phonetics. ### Specific Issues: - **Inconsistent Data Format**: The original data lacks time alignment and phone-level segmentation, limiting its direct use in many studies. - **Variable Data Quality**: The quality of data varies significantly across different languages, requiring manual correction and review. - **Inconsistent Symbols**: The phonetic symbols used in the original transcriptions are inconsistent and need standardization. - **Audio Quality Issues**: Some recordings have noise interference or are incomplete, requiring filtering and processing. By addressing these issues, the VoxAngeles corpus aims to provide researchers and educators with a more reliable and user-friendly multilingual speech dataset.