Modeling the Acoustic Correlates of Expre Expressive Text-to-Spe

Hongwu Yang,Helen M. Me,Lianhong Cai
2006-01-01
Abstract:This paper proposes a novel approach for describing the expressive elements in text genres and modeling their acoustic correlates for expressive text-to-speech synthesis (TTS). We apply the three-dimensional PAD (pleasure-displeasure, arousal-nonarousal and dominance-submissiveness) model in describing expressivity. In particular, we define a set of principles for annotating the P and A values of prosodic words found in texts from the tourist information domain. These text passages may be categorized into the descriptive genre (e.g. describing a beautiful scenic spot), the informative genre (e.g. presenting the opening hours of a museum) and the procedural genre (e.g. offering bus routes to a landmark). We choose the prosodic word as the basic unit for analysis since it bridges textual input with (synthetic) speech output. Analysis of contrastive (neutral versus expressive) recordings uncovers the acoustic correlates of annotated P and A values. This enables us to develop a non-linear model that can transform neutral speech to resemble expressive speech, according to the P and A values of the input text. Perceptual evaluation of the speech outputs shows that over 70% of the prosodic words carry appropriate expressivity.
What problem does this paper attempt to address?