Abstract:This paper proposes to select frame-sized speech segments for waveform concatenation speech synthesis using neural network based acoustic models. First, a deep neural network (DNN) based frame selection method is presented. In this method, three DNNs are adopted to calculate target costs and concatenation costs respectively for selecting candidate frames of 5ms length. One DNN is built in the same way as the DNN-based statistical parametric speech synthesis, which predicts target acoustic features given linguistic context inputs. The distance between the acoustic features of a candidate unit and the predicted ones for a target unit is calculated as the target cost. The other two DNNs are constructed to predict the acoustic features at current frame using its context features and the acoustic features of preceding frames. At synthesis time, these two DNNs are employed to calculate the concatenation cost for each candidate frame given its preceding ones. Furthermore, recurrent neural networks (RNNs) with long short-term memory (LSTM) cells are adopted to replace DNNs for acoustic modeling in order to make better use of the sequential information. A strategy of using multi-frame instead of single frame as the basic unit for selection is also presented to reduce the concatenation points within synthetic speech. Experimental results show that our proposed method can achieve better naturalness than the hidden Markov model (HMM)-based frame selection method and the HMM-based parametric speech synthesis method.

HMM-based Unit Selection Using F

HMM-based Unit Selection Using Frame Sized Speech Segments.

HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

HMM-BASED HIERARCHICALUNITSELECTIONCOMBINING KULLBACK-LEIBLER DIVERGENCE WITH LIKELIHOODCRITERION

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

DNN-based unit selection using frame-sized speech segments

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

HMM-based Unit Selection Speech Synthesis Using Log Likelihood Ratios Derived from Perceptual Data

Minimum Unit Selection Error Training for HMM-based Unit Selection Speech Synthesis System

Hierarchical Non-Uniform Unit Selection Based on Prosodic Structure

A Novel Hybrid Approach for Mandarin Speech Synthesis

Selecting optimal non-uniform units for hierarchical unit selection

Trainable Unit Selection Speech Synthesis under Statistical Framework

Stable boundary-based non-uniform unit selection in speech synthesis

A novel unit selection method for concatenation speech system using similarity measure

Building HMM based unit-selection speech synthesis system using synthetic speech naturalness evaluation score

Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

Optimization Method for Unit Selection Speech Synthesis Based on Synthesis Quality Predictions

Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech