Abstract:Amethod of learning andmodeling unit embeddings using deep neutral networks (DNNs) is presented in this article for unit-selection-based Mandarin speech synthesis. Here, a unit embedding is defined as a fixed-length embedding vector for a phone-sized unit candidate in a corpus. Modeling phone-sized embedding vectors instead of frame-sized acoustic features can better measure the long-term dependencies among consecutive units in an utterance. First, a DNN with an embedding layer is built to learn the embedding vectors of all unit candidates in the corpus from scratch. In order to enable the extracted embedding vectors to carry both acoustic and linguistic information of unit candidates, a multitarget learning strategy is designed for the DNN. Its optional prediction targets include frame-level acoustic features, unit durations, monophone and tone identifiers, and context classes. Then, another two DNNs are constructed to map linguistic features toward the extracted embedding vectors. One of them employs the unit vectors of preceding phones besides the linguistic features of current phone as its input. At synthesis time, the distances between the unit vectors predicted by these two DNNs and the ones derived from unit candidates are used as a part of the target cost and a part of the concatenation cost, respectively. Our experiments on a Mandarin speech synthesis corpus demonstrate that learning and modeling unit embeddings improve the naturalness of hidden Markov model (HMM)-based unit selection speech synthesis. Furthermore, integrating multiple targets for learning unit embeddings achieves better performance than using only acoustic targets according to our subjective evaluation results.

Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis.

A comparable study of modeling units for end-to-end Mandarin speech recognition

Research on acoustic Model of Putian Dialect Speech Recognition Based on Deep Learning

Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

An Investigation of High-Resolution Modeling Units of Deep Neural Networks for Acoustic Scene Classification

Acoustic Modeling Based On Chinese Phonetics Knowledge

Research on Inter-Syllable Context-Dependent Acoustic Unit for Mandarin Continuous Speech Recognition.

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

An Acoustic Model for English Speech Recognition Based on Deep Learning

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning

Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition

Deep neural networks for syllable based acoustic modeling in Chinese speech recognition.

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

Enhancing CTC-based speech recognition with diverse modeling units