Abstract:In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and $F_{0}$ components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Deep Belief Network-Based Post-Filtering For Statistical Parametric Speech Synthesis

Discriminative Multi-Stream Postfilters Based on Deep Learning for Enhancing Statistical Parametric Speech Synthesis

DNN-based Stochastic Postfilter for HMM-based Speech Synthesis

LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

DBN-based Spectral Feature Representation for Statistical Parametric Speech Synthesis

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

Statistical Modification Based Post-Filtering Technique for HMM-based Speech Synthesis

Spectral Modeling Using Neural Autoregressive Distribution Estimators for Statistical Parametric Speech Synthesis

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Parallel Synthesis for Autoregressive Speech Generation

Extracting Spectral Features Using Deep Autoencoders with Binary Distributed Hidden Units for Statistical Parametric Speech Synthesis.

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis