Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

Mu Wang,Zhiyong Wu,Xixin Wu,Helen M. Meng,Shiyin Kang,Jia Jia,Lianhong Cai
DOI: https://doi.org/10.1109/ACIIAsia.2018.8470334
2018-01-01
Abstract:End-to-end text-to-speech (E2E TTS) synthesis has achieved great success. This work investigates the emphatic speech synthesis and control mechanisms in the E2E framework and proposes an E2E-based method for transferring emphasis characteristic between speakers. Characteristic differences between emphatic and neutral speech are learned from a smallscale corpus containing parallel neutral and emphasis speech utterances recorded by one speaker and further transferred to another speaker so that we can generate emphatic speech with latter speakers voice. Emphasis embedding is injected to the encoder of the extended E2E TTS model to capture the aforementioned differences; while the decoder and attention module are used to decode those differences into synthetic neutral / emphatic speech. Speaker codes linked to the decoder and attention module provide the E2E model the ability for characteristic transferring between speakers. To control the emphatic strength, an encoder memory manipulation mechanism is proposed. Experimental results indicate the effectiveness of our proposed model.
What problem does this paper attempt to address?