Abstract:Research in speech technologies and comparative linguistics depends on access to diverse and accessible speech data. The UCLA Phonetics Lab Archive is one of the earliest multilingual speech corpora, with long-form audio recordings and phonetic transcriptions for 314 languages (Ladefoged et al., 2009). Recently, 95 of these languages were time-aligned with word-level phonetic transcriptions (Li et al., 2021). Here we present VoxAngeles, a corpus of audited phonetic transcriptions and phone-level alignments of the UCLA Phonetics Lab Archive, which uses the 95-language CMU re-release as our starting point. VoxAngeles also includes word- and phone-level segmentations from the original UCLA corpus, as well as phonetic measurements of word and phone durations, vowel formants, and vowel f0. This corpus enhances the usability of the original data, particularly for quantitative phonetic typology, as demonstrated through a case study of vowel intrinsic f0. We also discuss the utility of the VoxAngeles corpus for general research and pedagogy in crosslinguistic phonetics, as well as for low-resource and multilingual speech technologies. VoxAngeles is free to download and use under a CC-BY-NC 4.0 license.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the availability and diversity of cross-linguistic speech data, particularly in low-resource and multilingual speech technologies. Specifically, the paper introduces the VoxAngeles corpus, a manually corrected and reviewed phone-level aligned version of the UCLA Phonetics Lab Archive. Through this work, the authors aim to enhance the usability of the original data, especially for quantitative phonological typology research, and demonstrate the utility of this corpus through a case study on vowel intrinsic f0. ### Main Objectives: 1. **Improve Data Availability**: By providing time-aligned speech transcriptions and phone-level segmentation, make the original data more accessible for various research and applications. 2. **Enhance Data Quality**: Ensure the accuracy and consistency of the data through manual correction and review. 3. **Support Cross-Linguistic Research**: Provide a high-quality speech dataset that includes multiple languages to support cross-linguistic speech technology and phonological research. 4. **Promote Education and Teaching**: Offer valuable resources for the research and teaching of cross-linguistic phonetics. ### Specific Issues: - **Inconsistent Data Format**: The original data lacks time alignment and phone-level segmentation, limiting its direct use in many studies. - **Variable Data Quality**: The quality of data varies significantly across different languages, requiring manual correction and review. - **Inconsistent Symbols**: The phonetic symbols used in the original transcriptions are inconsistent and need standardization. - **Audio Quality Issues**: Some recordings have noise interference or are incomplete, requiring filtering and processing. By addressing these issues, the VoxAngeles corpus aims to provide researchers and educators with a more reliable and user-friendly multilingual speech dataset.

Phonetic Segmentation of the UCLA Phonetics Lab Archive

Corpus Phonetics Tutorial

An Anechoic, High-Fidelity, Multidirectional Speech Corpus

Common Voice: A Massively-Multilingual Speech Corpus

Speech vocoding for laboratory phonology

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Updated Corpora and Benchmarks for Long-Form Speech Recognition

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Pronunciation recognition of English phonemes /\textipa{@}/, /æ/, /\textipa{A}:/ and /\textipa{2}/ using Formants and Mel Frequency Cepstral Coefficients

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

MLS: A Large-Scale Multilingual Dataset for Speech Research

Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language

AlloVera: A Multilingual Allophone Database

AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages

A lexical database tool for quantitative phonological research

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Voices Obscured in Complex Environmental Settings (VOICES) corpus

TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline

A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks

MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible