Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Xiao Zhou,Zhen-Hua Ling,Zhi-Ping Zhou,Li-Rong Dai
DOI: https://doi.org/10.21437/interspeech.2018-1198
2018-01-01
Abstract:This paper presents a method of learning and modeling unit embeddings using deep neutral networks (DNNs) to improve the performance of HMM-based unit selection speech synthesis. First, a DNN with an embedding layer is built to learn a fixed-length embedding vector for each phone-sized candidate unit in the corpus from scratch. Then, another two DNNs are constructed to map linguistic features toward the extracted unit vector of each phone. One of them employs the unit vectors of preceding phones as model input. At synthesis time, the L-2 distances between the unit vectors predicted by these two DNNs and the ones derived from candidate units are integrated into the target cost and the concatenation cost of HMM-based unit selection speech synthesis respectively. Experimental results demonstrate that the unit vectors estimated using only acoustic features display phone-dependent clustering properties. Furthermore, integrating unit vector distances into cost functions, especially the concatenation cost, improves the naturalness of HMM-based unit selection speech synthesis in our experiments.
What problem does this paper attempt to address?