Trainable Unit Selection Speech Synthesis under Statistical Framework

RenHua Wang,LiRong Dai,ZhenHua Ling,Yu Hu
DOI: https://doi.org/10.1007/s11434-009-0267-3
2009-01-01
Chinese Science Bulletin (Chinese Version)
Abstract:This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthesis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system construction and improve the naturalness of synthetic speech further.
What problem does this paper attempt to address?