A unified front-end framework for English text-to-speech synthesis
Zelin Ying,Chen Li,Yu Dong,Qiuqiang Kong,Qiao Tian,Yuanyuan Huo,Yuxuan Wang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447144
2024-03-25
Abstract:The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address key issues in the front-end processing of English Text-to-Speech (TTS) systems. Specifically, the research focuses on the front-end modules of English TTS systems, including Text Normalization (TN), Prosody Word Prosody Phrase (PWPP), and Grapheme-to-Phoneme (G2P) conversion. Current research often focuses on one of these modules in isolation, neglecting their interdependencies, which results in suboptimal performance of each module.
To address these issues, the authors propose a unified front-end framework that comprehensively considers the dependencies among the TN, PWPP, and G2P modules and optimizes them simultaneously through a shared multi-task model. This design not only improves the performance of each module but also enhances the overall performance of the entire front-end system.
Specifically, the main contributions of the paper are as follows:
1. **Proposed a systematic English TTS front-end framework**: This framework adopts a shared multi-task model, marking the first time that English TTS front-end tasks have been unified within a single framework.
2. **Enhanced the flexibility of the TN module**: The TN module employs a combined rule-based and model-based approach, which effectively handles non-standard vocabulary and flexibly transcribes hot words.
3. **Utilized the relationships among labels in the PWPP module**: The PWPP module independently predicts different levels of prosodic information through a hierarchical sequence labeling structure, better leveraging the hierarchical relationships among labels.
4. **Introduced a polyphone task to improve the accuracy of homographs in the G2P module**: By identifying and distinguishing different pronunciations of homographs, the accuracy of the G2P module is further improved.
Experimental results show that this unified framework achieves state-of-the-art levels in the TN, PWPP, and G2P modules. Particularly in the G2P module, the introduction of the polyphone task enables the system to more accurately handle the issue of homographs.