Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction.

Yibin Zheng,Jianhua Tao,Zhengqi Wen,Ya Li,Bin Liu
DOI: https://doi.org/10.21437/interspeech.2017-1086
2017-01-01
Abstract:Accurate modeling and prediction of speech-sound durations are important in generating natural synthetic speech. This paper focuses on both feature and training objective aspects to improve the performance of the phone duration model for speech synthesis system. In feature aspect, we combine the feature representation from gradient boosting decision tree (GBDT) and phoneme identity embedding model (which is realized by the jointly training of phoneme embedded vector (PEV) and word embedded vector (WEV)) for BLSTM to predict the phone duration. The PEV is used to replace the one-hot phoneme identity, and GBDT is utilized to transform the traditional contextual features. In the training objective aspect, a new training objective function which taking into account of the correlation and consistency between the predicted utterance and the natural utterance is proposed. Perceptual tests indicate the proposed methods could improve the naturalness of the synthetic speech, which benefits from the proposed feature representation methods could capture more precise contextual features, and the proposed training objective function could tackle the over-averaged problem for the generated phone durations.
What problem does this paper attempt to address?