Abstract:Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, AudioLCM integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. https://AudioLCM.github.io/. Code is Available https://github.com/Text-to-Audio/AudioLCM

Efficient Parallel Audio Generation Using Group Masked Language Modeling

Efficient Parallel Audio Generation using Group Masked Language Modeling

SoundStorm: Efficient Parallel Audio Generation

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

AudioLM: a Language Modeling Approach to Audio Generation

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Efficient Neural Music Generation

Parallel Synthesis for Autoregressive Speech Generation

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Whisper-GPT: A Hybrid Representation Audio Large Language Model

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Parallel and High-Fidelity Text-to-Lip Generation