Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

Tianjiao Du,Jun Chen,Jiasheng Lu,Qinmei Xu,Huan Liao,Yupeng Chen,Zhiyong Wu
DOI: https://doi.org/10.1109/icme57554.2024.10687830
2024-01-01
Abstract:The controllability of text-to-audio (TTA) systems is constrained due to the exclusive generation of audio from text, leading to issues of temporal disorganization and semantic omission. Some studies have endeavored to integrate conditions, such as frame-level annotation of sound events, to regulate the generated audio content. However, this necessitates a substantial amount of paired data and time for fine-tuning or training the model. This paper introduces a novel, training-free approach for controllable TTA generation based on temporal condition, e.g., the location and duration of corresponding sound events. Through updating latent variables during inference process, our approach ensures that the content generated by pretrained TTA models adheres to the specified temporal conditions, thereby achieving precise temporal control. Experimental results affirm that the proposed approach effectively governs the initiation and conclusion of sound events as indicated by the text, while preserving the high-quality and diverse generation capabilities of the diffusion model
What problem does this paper attempt to address?