Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Fanbo Meng,Zhiyong Wu,Helen Meng,Jia,Lianhong Cai
DOI: https://doi.org/10.21437/interspeech.2012-159
2012-01-01
Abstract:Emphasis is an important form of expressiveness in speech. Hidden Markov model (HMM) based speech synthesis has shown great flexibility in generating expressive speech. This paper proposes a hierarchical model based on HMMs aiming at synthesizing emphatic speech of both high emphasis quality and high naturalness with the limited amount of data. Decision trees (DTs) are constructed with non-emphasis-related questions using both neutral and emphasis corpora. The data in each leaf node of the DTs are classified into 6 emphasis categories according to the emphasis-related questions. The data in the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaf nodes of the DTs, a method based on cost calculation is proposed to select a suitable HMM in the same leaf node for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. Experiments show that the proposed hierarchical model can synthesize emphatic speech with high quality for both naturalness and emphasis, using limited amount of training data.
What problem does this paper attempt to address?