Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

Tao Jiang,Zhiyong Wu,Jia,Lianhong Cai
DOI: https://doi.org/10.1109/iscslp.2012.6423489
2012-01-01
Abstract:In concatenative based speech synthesis, the purpose of unit selection is to select proper speech units from speech corpus by measuring how well the selected units match the given features. Perceptual test indicates that some features are always preferred to make perceptual distinction between units. Such features should be judged prior to others in unit selection. In this work, we attempt to identify the priorities for different features and try to optimize the unit selection with perceptual clustering. Out approach first clusters the speech units with hierarchical clustering based on a perceptual distance measurement between different speech units. A method to identify the questions (concerning the features) is then proposed to build the decision tree from the clustering result. The features used in the decision tree are the preferred ones, and the other features are used in the target cost function. Linear discriminant analysis (LDA) is then adopted to train the weights for the target cost function from the clustering result to make weights more reasonable and perceptual related.. Experimental results indicate that the optimized unit selection can generate synthetic speech with higher naturalness than the previous approach.
What problem does this paper attempt to address?