Multilingual context-based pronunciation learning for Text-to-Speech

Giulia Comini,Manuel Sam Ribeiro,Fan Yang,Heereen Shim,Jaime Lorenzo-Trueba
2023-07-31
Abstract:Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address pronunciation-related tasks in multilingual Text-to-Speech (TTS) systems. Specifically, the research team proposes a unified multilingual front-end system capable of handling various pronunciation-related tasks traditionally managed by independent TTS front-end modules, such as Grapheme-to-Phoneme (G2P) conversion, homograph disambiguation, polyphone disambiguation, post-lexical pronunciation rules, and implicit stress marking in Arabic. By constructing a model based on the Transformer architecture, this system can be trained for multiple languages and dialects, demonstrating performance on par with or even better than single-language solutions across different tasks. However, the research also points out that there is a trade-off between multilingual and single-language models in certain specific tasks and languages, depending on factors such as task characteristics, language differences, and data quality.