Abstract:Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.

Quasi-Fully Convolutional Neural Network With Variational Inference For Speech Synthesis

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Inference skipping for more efficient real-time speech enhancement with parallel RNNs

Hierarchical RNNs for Waveform-Level Speech Synthesis

Speech Super-Resolution Using Parallel WaveNet

Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS

Parallel Synthesis for Autoregressive Speech Generation

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

SpeedySpeech: Efficient Neural Speech Synthesis

Singing voice synthesis based on convolutional neural networks

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis