Abstract:Amethod of learning andmodeling unit embeddings using deep neutral networks (DNNs) is presented in this article for unit-selection-based Mandarin speech synthesis. Here, a unit embedding is defined as a fixed-length embedding vector for a phone-sized unit candidate in a corpus. Modeling phone-sized embedding vectors instead of frame-sized acoustic features can better measure the long-term dependencies among consecutive units in an utterance. First, a DNN with an embedding layer is built to learn the embedding vectors of all unit candidates in the corpus from scratch. In order to enable the extracted embedding vectors to carry both acoustic and linguistic information of unit candidates, a multitarget learning strategy is designed for the DNN. Its optional prediction targets include frame-level acoustic features, unit durations, monophone and tone identifiers, and context classes. Then, another two DNNs are constructed to map linguistic features toward the extracted embedding vectors. One of them employs the unit vectors of preceding phones besides the linguistic features of current phone as its input. At synthesis time, the distances between the unit vectors predicted by these two DNNs and the ones derived from unit candidates are used as a part of the target cost and a part of the concatenation cost, respectively. Our experiments on a Mandarin speech synthesis corpus demonstrate that learning and modeling unit embeddings improve the naturalness of hidden Markov model (HMM)-based unit selection speech synthesis. Furthermore, integrating multiple targets for learning unit embeddings achieves better performance than using only acoustic targets according to our subjective evaluation results.

DNN-based unit selection using frame-sized speech segments

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

HMM-based Unit Selection Using Frame Sized Speech Segments.

HMM-based Unit Selection Using F

Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models for Unit Selection Speech Synthesis

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis.

Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion

Progressive Neural Networks Based Features Prediction for the Target Cost in Unit-Selection Speech Synthesizer

Unitnet: A Sequence-To-Sequence Acoustic Model For Concatenative Speech Synthesis

Context features based pre-selection and weight prediction in concatenation speech synthesis system

Hierarchical Non-Uniform Unit Selection Based on Prosodic Structure

Deep Metric Learning For The Target Cost In Unit-Selection Speech Synthesizer

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

Stable boundary-based non-uniform unit selection in speech synthesis

A data driven method for target and concatenation cost calculation with KL-Divergence in Mandarin hybrid speech synthesis

HMM-based Unit Selection Speech Synthesis Using Log Likelihood Ratios Derived from Perceptual Data

HMM-BASED HIERARCHICALUNITSELECTIONCOMBINING KULLBACK-LEIBLER DIVERGENCE WITH LIKELIHOODCRITERION

Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

A novel unit selection method for concatenation speech system using similarity measure