Abstract:Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at <a class="link-external link-https" href="https://conditionaudiogen.github.io/conditionaudiogen/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address the limitations of existing text-based audio generation models in terms of controllability. Specifically, when generating audio solely based on text information, it is challenging for the model to fully control all details in the audio, especially for fine-grained attributes such as temporal order, pitch, and energy. To solve this problem, the research team proposes a new method that enhances the controllability of existing pre-trained text-to-audio (TTA) models by introducing additional conditions, including content timestamps and style-related pitch and energy contours. The main contributions of the paper can be summarized as follows: 1. **Introduction of a multi-condition audio generation task**: The authors propose a new task that uses text and other control conditions (such as timestamps, pitch contours, and energy contours) to guide audio generation, thereby achieving fine-grained customization of audio content and style. 2. **Design of a new dataset and evaluation metrics**: Due to the lack of datasets and evaluation metrics suitable for this task, the authors integrated existing datasets to create a new dataset and designed a series of evaluation metrics, which can serve as benchmarks for future related work. 3. **Improved audio generation model**: A new model based on the existing pre-trained TTA model is proposed, which can accept not only text as conditional input but also other control conditions, thereby achieving more refined and precise control over the audio generation process. Experimental results demonstrate the effectiveness of this model in generating audio with higher controllability. In summary, the goal of this paper is to achieve a higher level of controllability and flexibility in the field of audio generation, especially when dealing with application scenarios such as video creation, virtual reality, and interactive systems, enabling more precise control over various audio features.

Audio Generation with Multiple Conditional Diffusion Model

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Controllable Generation with Text-to-Image Diffusion Models: A Survey

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Text Diffusion with Reinforced Conditioning

Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation