Abstract:Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: <a class="link-external link-https" href="https://haidog-yaqub.github.io/EzAudio-Page/" rel="external noopener nofollow">this https URL</a>.

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Autoregressive Speech Synthesis without Vector Quantization

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

MelNet: A Generative Model for Audio in the Frequency Domain

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

ETTA: Elucidating the Design Space of Text-to-Audio Models

Efficient Neural Music Generation

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models