Abstract:Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

Differentiable All-pole Filters for Time-varying Audio Systems

Symmetric and asymmetric Gaussian weighted linear prediction for voice inverse filtering

Differentiable Signal Processing With Black-Box Audio Effects

Vocal Timbre Effects with Differentiable Digital Signal Processing

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Biomimetic Frontend for Differentiable Audio Processing

Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

Parallel Synthesis for Autoregressive Speech Generation

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech

High quality, lightweight and adaptable TTS using LPCNet