Audio Generation with Multiple Conditional Diffusion Model

Zhifang Guo,Jianguo Mao,Rui Tao,Long Yan,Kazushige Ouchi,Hong Liu,Xiangdong Wang
2023-12-28
Abstract:Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at <a class="link-external link-https" href="https://conditionaudiogen.github.io/conditionaudiogen/" rel="external noopener nofollow">this https URL</a>
Sound,Computation and Language,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the limitations of existing text-based audio generation models in terms of controllability. Specifically, when generating audio solely based on text information, it is challenging for the model to fully control all details in the audio, especially for fine-grained attributes such as temporal order, pitch, and energy. To solve this problem, the research team proposes a new method that enhances the controllability of existing pre-trained text-to-audio (TTA) models by introducing additional conditions, including content timestamps and style-related pitch and energy contours. The main contributions of the paper can be summarized as follows: 1. **Introduction of a multi-condition audio generation task**: The authors propose a new task that uses text and other control conditions (such as timestamps, pitch contours, and energy contours) to guide audio generation, thereby achieving fine-grained customization of audio content and style. 2. **Design of a new dataset and evaluation metrics**: Due to the lack of datasets and evaluation metrics suitable for this task, the authors integrated existing datasets to create a new dataset and designed a series of evaluation metrics, which can serve as benchmarks for future related work. 3. **Improved audio generation model**: A new model based on the existing pre-trained TTA model is proposed, which can accept not only text as conditional input but also other control conditions, thereby achieving more refined and precise control over the audio generation process. Experimental results demonstrate the effectiveness of this model in generating audio with higher controllability. In summary, the goal of this paper is to achieve a higher level of controllability and flexibility in the field of audio generation, especially when dealing with application scenarios such as video creation, virtual reality, and interactive systems, enabling more precise control over various audio features.