Toward Synthesizing Expressive Mandarin Speech

Hongwu Yang,Shuang Li,Lianhong Cai
2005-01-01
Abstract:Research efforts in the field of TTS have placed emphasis on the naturalness in synthesized speech to facilitate various applications in Human-Computer Interaction (HCI). The ideal synthetic speech for HCI should not only have proper pronunciations, but also convey the appropriate semantics within the context of use. "Context" refers to the textual context of the document, the identity of the interlocutors in the interactive conversation, the application scenarios, etc. For example, synthetic speech for news reports may adopt lucid and smooth characters while sports commentaries may call for a more animated character. This paper focuses on expressive text-to-speech synthesis. Expressions in speech encompass many elements. Our work focuses on emotional and stylized synthetic speech in synthesizing speech. Emotion originates from the speakers' psychological and physical states and is realized through spectral and prosodic parameters. Style is dependent on the semantics of the spoken message and the conversation scenarios so that it can be realized with global prosodic features. Emotion and style are also interdependent. In general, emotion has relatively local effects and its acoustic parameters are more dynamic while style has relatively global effects and its acoustic parameters are more stable in the speech signals. Emotion and style thus jointly modify the acoustic features of the speech signal for more affective and effective conveyance of the underlying message. Thus a TTS system that can simulate different emotions and styles will make HCI more natural and desirable.
What problem does this paper attempt to address?