Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

Xixin Wu,Yuewen Cao,Mu Wang,Songxiang Liu,Shiyin Kang,Zhiyong Wu,Xunying Liu,Dan Su,Dong Yu,Helen Meng
DOI: https://doi.org/10.21437/interspeech.2018-1991
2018-01-01
Abstract:Synthesizing expressive speech with appropriate prosodic variations, e.g., various styles, still has much room for improvement. Previous methods have explored to use manual annotations as conditioning attributes to provide variation information. How-ever, the related training data are expensive to obtain and the annotated style codes can be ambiguous and unreliable. In this paper, we explore utilizing the residual error as conditioning attributes. The residual error is the difference between the prediction of a trained average model and the ground truth. We encode the residual error into a style embedding via a neural network-based error encoder. The style embedding is then fed to the target synthesis model to provide information for modeling various style distributions more accurately. The average model and the error encoder are jointly optimized with the target synthesis model. Our proposed method has two advantages: 1) the embedding is automatically learned with no need of manual style annotations, which helps overcome data sparsity and ambiguity limitations; 2) For any unseen audio utterance, the style embedding can be efficiently generated. This enables rapid adaptation to the desired style to be achieved with only a single adaptation utterance. Experimental results show that our proposed method outperforms the baseline model in both speech quality and style similarity.
What problem does this paper attempt to address?