Word-Level Emphasis Modelling in Hmm-Based Speech Synthesis

K. Yu,F. Mairesse,S. Young
DOI: https://doi.org/10.1109/icassp.2010.5495690
2010-01-01
Abstract:Expressive speech synthesis has recently attracted great interest. Word-level emphasis is an important form of expressiveness to distinguish between what is the focus of the utterance, and what the computer system expects to be known by the user. Previous work on emphasis synthesis requires emphatic data collected specifically for that task. In this paper, a statistical approach that models and extracts word-level emphasis patterns from natural speech is investigated within the HMM based speech synthesis framework. Compared to emphatic speech collected specifically for this task, the cues of emphasis in natural speech are weaker and heavily affected by various suprasegmental features. Two new decision tree clustering approaches, two-pass and factorized decision tree, are proposed to effectively address this problem. Experiments show that both approaches can convey emphasis significantly better than traditional decision tree clustering and HMM adaptation. While the two-pass decision tree approach outperformed the factorized decision tree approach in an emphasis synthesis test, the latter led to significantly better naturalness and hence achieved a better overall balance.
What problem does this paper attempt to address?