Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models for Unit Selection Speech Synthesis

Xiao Zhou,Zhen-Hua Ling,Li-Rong Dai
DOI: https://doi.org/10.1109/icassp40776.2020.9053812
2020-01-01
Abstract:This paper presents a method of using the intermediate representations between linguistic and acoustic features in a Tacotron model to derive the cost functions for unit selection speech synthesis. By extracting the outputs of the Tacotron encoder, each phone-sized candidate unit in the corpus is represented by a fixed-length unit vector. Similarly, each target unit to be synthesized is also converted into a unit vector of the same dimension by encoding the input phone sequence. The normalized Euclidean distances between these two vectors are utilized to fulfill unit pre-selection and to calculate the target cost for unit selection. Then, another DNN which predicts the unit vector of each phone from its preceding ones is constructed to derive the concatenation cost function. Experimental results demonstrate that the unit vectors extracted from Tacotron contain both duration and acoustic information of phone units. Comparing with our previous work, which learned unit vectors using a DNN and only acoustic features, the method proposed in this paper further improves the naturalness of unit selection speech synthesis in our experiments.
What problem does this paper attempt to address?