Abstract:Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, AudioLCM integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. https://AudioLCM.github.io/. Code is Available https://github.com/Text-to-Audio/AudioLCM

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Towards audio language modeling -- an overview

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

AudioLM: a Language Modeling Approach to Audio Generation

Learning Source Disentanglement in Neural Audio Codec

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation

Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Generating Stereophonic Music with Single-Stage Language Models.

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Audio Language Modeling using Perceptually-Guided Discrete Representations

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec