Abstract:Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control

Multimodal Latent Language Modeling with Next-Token Diffusion

Diffusion-LM Improves Controllable Text Generation

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

Self-conditioned Embedding Diffusion for Text Generation

Diffusion Guided Language Modeling

Promises, Outlooks and Challenges of Diffusion Language Modeling

TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation

Energy-Based Diffusion Language Models for Text Generation

Latent Diffusion for Language Generation

Utilizing Latent Diffusion Model to Accelerate Sampling Speed and Enhance Text Generation Quality

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Think While You Generate: Discrete Diffusion with Planned Denoising

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Simple and Effective Masked Diffusion Language Models