Autosegmental Neural Nets 2.0: An Extensive Study of Training Synchronous and Asynchronous Phones and Tones for Under-Resourced Tonal Languages

Jialu Li,Mark Hasegawa-Johnson
DOI: https://doi.org/10.1109/taslp.2022.3178238
2022-01-01
Abstract:Phones, the segmental units in the International Phonetic Alphabet (IPA), include isolated consonants or vowels; tones, the suprasegemental units, represent pitch and voice quality movements that may span many phones. The timings of tones and phones are loosely connected, e.g., tones may be synchronized with their associated vowels, syllable finals, or a sequence of two or three syllables depending on the language. Many past studies have investigated cross-lingual adaptation in an automatic speech recognition (ASR) tone-marked phone model, yet very few studied the interaction between cross-lingual adaptation and tone-phone synchronization. In this study, we perform an extensive study by multilingual training on four tonal languages and cross-lingual testing on the fifth, in a five-fold cross-validation framework, using four CTC-based systems that impose different degrees of synchronization between tones and phones. We discover that multilingual and cross-lingual training benefit from different training architectures. In multilingual training, when a large corpus of test-language training data is part of the training corpus, a system that requires synchronization of tones with phones produces significantly lower tone error rates than any of the systems that score tones and phones asynchronously. In cross-lingual training, however, when only limited adaptation data are available in the test language, jointly training synchronous tone-marked phones together with asynchronous phones and tones, as three separate system outputs jointly optimized using a multi-task learning framework, consistently and significantly outperforms the system that requires tone-phone synchrony.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?