Modeling the Acoustic Correlates of Dialog Act for Expressive Chinese TTS Synthesis

Hongwu Yang,Helen M. Meng,Lianhong Cai
DOI: https://doi.org/10.1049/cp:20080758
2008-01-01
Abstract:This paper proposed a novel approach for describing the expressivity of dialog text and modelling their acoustic correlates for expressive text-to-speech (TTS) synthesis. We applied the Dialog Acts (DAs) in describing expressivity. In particular, we set tip a Wizard-of-Oz (WoZ) data collection framework to collect the tourism domain corpus and annotated the DAs. A Pitch Target model which is optimized to describe Mandarin F0 contours was introduced to model the pitch contour of Mandarin syllables. Then a Generalized Regression Neural Network (GRNN) based model was developed, that can transform acoustic features of neutral speech (parameters of pitch target model, duration, energy and pauses) to resemble expressive speech, according to the DA of the input text. Perceptual evaluation of the modified speech outputs shows that over 63% of the utterances carry appropriate expressivity. Expressive Mean Opinion Score also demonstrated that modified speech improved the expressivity of the neutral speech.
What problem does this paper attempt to address?