Abstract:In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and $F_{0}$ components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Learning Prosodic Patterns for Mandarin Speech Synthesis

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

HMM Based TTS for Mixed Language Text.

Prosody Model for Mandarin Text-to-Speech System

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

Prosodic Correlation Model in Text-to-Speech Synthesis

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS