Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training

Fanbo Meng,H. Meng,Zhiyong Wu,Lianhong Cai
2010-01-01
Abstract:We present a perturbation model that can modify the acoustic features of neutral speech in order to synthesize focus for certain words. In doing so, we can generate expressive speech output that highlights important speech segments to attract the listener’s attention. The ultimate objective is to synthesize corrective feedback in a computer-aided pronunciation training (CAPT) system. This work involves the design and collection of a speech corpus, whose text prompts contain focus words. Each prompt is recorded twice – a neutral production followed by an expressive one where specific words are highlighted with focus. The phones in these recordings are modeled in six different classes, based on their relations with stressed syllables in focus words. Phone boundaries are obtained automatically by forced alignment with an automatic speech recognizer. Acoustic features of the phones, relating to f0, energy and duration, are extracted. Features that have highest correlation with the phone classes, as well as low variances, are incorporated into the perturbation model. The model is applied to neutral recordings of 20 test sentences. Results from a listening test show that the 13 subjects can identify the focus words with an accuracy of over 98%. The perceived degree of focus in the identified words achieves a mean score of 4.5 in a five-point Likert scale.
What problem does this paper attempt to address?