Abstract:Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: <a class="link-external link-https" href="https://haidog-yaqub.github.io/EzAudio-Page/" rel="external noopener nofollow">this https URL</a>.

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Audio Generation with Multiple Conditional Diffusion Model

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Video-to-Audio Generation with Fine-grained Temporal Semantics

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Text Diffusion with Reinforced Conditioning

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer