Abstract:Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.

Discrete Duration Model for Speech Synthesis.

Comparison of Modeling Target in LSTM-RNN Duration Model.

Neural Network-Based Modeling of Phonetic Durations

Full Covariance State Duration Modeling for HMM-based Speech Synthesis

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

Expressive, Variable, and Controllable Duration Modelling in TTS

Duration Refinement by Jointly Optimizing State and Longer Unit Likelihood

Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction.

Modeling Duration and Intonation in Mandarin Chinese Synthesis with a Neural Network

State Duration-Based Segmental Probability Model for Chinese Speech

Duration Modeling of Neural TTS for Automatic Dubbing

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Duration-Distribution-Based HMM for Speech Recognition

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

State Duration-Based Segmental Probability Model

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition