Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Deepanway Ghosal,Navonil Majumder,Ambuj Mehrish,Soujanya Poria
2023-05-29
Abstract:The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.
Audio and Speech Processing,Artificial Intelligence,Computation and Language,Sound
What problem does this paper attempt to address?
The paper aims to address the problem of Text-to-Audio (TTA) generation and proposes a new method to improve existing technologies. Specifically: 1. **Adopting Instruction-Tuned Large Language Models (LLM)**: The paper uses FLAN-T5 as the text encoder, replacing the traditional joint text-audio encoders (such as CLAP), thereby improving text understanding and overall audio generation quality without the need for fine-tuning. 2. **Data Augmentation Method**: Unlike previous methods that randomly combine audio, the paper adopts a mixing method based on audio pressure levels, ensuring that each source audio in the mixed audio is well represented. With these improvements, the proposed model TANGO surpasses the current state-of-the-art model AudioLDM on multiple metrics in the AudioCaps test set, even with 63 times less training data. Additionally, experimental results show that TANGO also performs excellently in subjective evaluations (audio quality and relevance). Overall, this study demonstrates the potential of using instruction-tuned large language models in the task of text-to-audio generation.