Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Zhiyong Wu,Lianhong Cai,Helen M. Meng
DOI: https://doi.org/10.1109/iscslp.2010.5684494
2010-01-01
Abstract:This paper proposes an approach for modeling the prosody patterns of the acoustic features for Chinese expressive text-to-speech (TTS) synthesis. Based on the observation that the speaker usually tends to put more emphasis on one particular syllable within a multi-syllabic prosodic word, we identify such syllable as the core syllable that can be derived from the semantic stress and tone information of the text prompt. We then classify the syllables in speech into four classes, based on their relations with the core syllable in a prosodic word. We analyze the contrastive (neutral versus expressive) speech recordings for each of four classes, and develop a perturbation model that takes into account the prosody pattern to transform neutral speech to expressive speech. Perceptual experiments on both neutral speech recordings and neutral TTS outputs involving 13 subjects indicate that the proposed approach can significantly enhance expressivity in synthesizing expressive speech.
What problem does this paper attempt to address?