Discrete Duration Model for Speech Synthesis.

Bo Chen,Tianling Bian,Kai Yu
DOI: https://doi.org/10.21437/interspeech.2017-1144
2017-01-01
Abstract:The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-level duration is discrete given the linguistic features, which means the Gaussian hypothesis is no longer necessary. This paper provides an investigation on the performance of LSTM-RNN duration model that directly models the probability of the countable duration values given linguistic features using cross entropy as criteria. The multitask learning is also experimented at the same time, with a comparison to the standard LSTM-RNN duration model in objective and subjective measures. The result shows that directly modeling the discrete distribution has its benefit and multi-task model achieves better performance in phone-level duration modeling.
What problem does this paper attempt to address?