Abstract:The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.

What problem does this paper attempt to address?

The paper aims to address key issues in the front-end processing of English Text-to-Speech (TTS) systems. Specifically, the research focuses on the front-end modules of English TTS systems, including Text Normalization (TN), Prosody Word Prosody Phrase (PWPP), and Grapheme-to-Phoneme (G2P) conversion. Current research often focuses on one of these modules in isolation, neglecting their interdependencies, which results in suboptimal performance of each module. To address these issues, the authors propose a unified front-end framework that comprehensively considers the dependencies among the TN, PWPP, and G2P modules and optimizes them simultaneously through a shared multi-task model. This design not only improves the performance of each module but also enhances the overall performance of the entire front-end system. Specifically, the main contributions of the paper are as follows: 1. **Proposed a systematic English TTS front-end framework**: This framework adopts a shared multi-task model, marking the first time that English TTS front-end tasks have been unified within a single framework. 2. **Enhanced the flexibility of the TN module**: The TN module employs a combined rule-based and model-based approach, which effectively handles non-standard vocabulary and flexibly transcribes hot words. 3. **Utilized the relationships among labels in the PWPP module**: The PWPP module independently predicts different levels of prosodic information through a hierarchical sequence labeling structure, better leveraging the hierarchical relationships among labels. 4. **Introduced a polyphone task to improve the accuracy of homographs in the G2P module**: By identifying and distinguishing different pronunciations of homographs, the accuracy of the G2P module is further improved. Experimental results show that this unified framework achieves state-of-the-art levels in the TN, PWPP, and G2P modules. Particularly in the G2P module, the introduction of the polyphone task enables the system to more accurately handle the issue of homographs.

A unified front-end framework for English text-to-speech synthesis

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

Unified Mandarin TTS Front-end Based on Distilled BERT Model

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

A Unified Framework for Multilingual Text-to-speech Synthesis with SSML Specification As Interface

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

A Chinese Text-to-Speech System

Scalable Multilingual Frontend for TTS

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

Into-TTS : Intonation Template Based Prosody Control System

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

High quality Chinese text-to-speech system - BEYOND

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

Multilingual context-based pronunciation learning for Text-to-Speech

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding