Abstract:Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, AudioLCM integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. https://AudioLCM.github.io/. Code is Available https://github.com/Text-to-Audio/AudioLCM

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Multimodal Latent Language Modeling with Next-Token Diffusion

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Video-to-Audio Generation with Fine-grained Temporal Semantics

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

On The Open Prompt Challenge In Conditional Audio Generation

Retrieval-Augmented Text-to-Audio Generation

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

Text-to-Audio Generation Synchronized with Videos