Abstract:Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

The Parameterized Phoneme Identity Feature As a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis.

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Neural Speech Synthesis with Transformer Network.

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

DIAN: DURATION INFORMED AUTO-REGRESSIVE NETWORK FOR VOICE CLONING

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Deep Feed-Forward Sequential Memory Networks for Speech Synthesis