Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings

Manuel Sam Ribeiro,Giulia Comini,Jaime Lorenzo-Trueba
2023-07-31
Abstract:The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. G2P conversion is beneficial to various speech processing applications, such as text-to-speech and speech recognition. However, these tend to rely on manually-annotated pronunciation dictionaries, which are often time-consuming and costly to acquire. In this paper, we propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings. Our approach bootstraps a G2P with a small set of annotated examples. The G2P model is used to train a multilingual phone recognition system, which then decodes speech recordings with a phonetic representation. Given hypothesized phoneme labels, we learn pronunciation dictionaries for out-of-vocabulary words, and we use those to re-train the G2P system. Results indicate that our approach consistently improves the phone error rate of G2P systems across languages and amount of available data.
Audio and Speech Processing,Computation and Language
What problem does this paper attempt to address?
This paper aims to address the grapheme-to-phoneme (G2P) conversion task, particularly in low-resource scenarios. Traditionally, G2P tasks rely on manually annotated pronunciation dictionaries, which are time-consuming and costly. This paper proposes a method to improve G2P models by learning pronunciation examples from speech recordings. Specifically, the method first trains a base G2P model using a small amount of annotated data, then uses this model to predict pronunciations for a speech corpus in the target language. Next, a multilingual phone recognition system is trained using multilingual speech data and decodes the audio data of the target language. By hypothesizing phoneme labels, a new pronunciation dictionary is generated for retraining the G2P system. Experimental results show that this method effectively reduces the phoneme error rate (PER) across various languages and data amounts, with particularly significant improvements in low-resource scenarios.