AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

Yuhang Jia,Yang Chen,Jinghua Zhao,Shiwan Zhao,Wenjia Zeng,Yong Chen,Yong Qin
2024-09-29
Abstract:Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits. Code and demo can be found at <a class="link-external link-https" href="https://github.com/NKU-HLT/AudioEditor" rel="external noopener nofollow">this https URL</a>.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of performing precise editing and preserving the unedited parts in audio editing tasks. Although text - to - audio (TTA) generation based on diffusion models has made remarkable progress, there are still two main difficulties in the field of audio editing: 1. **Performing precise editing**: How to accurately modify specific parts in the audio during the editing process without affecting other parts. 2. **Preserving unedited parts**: How to keep the unedited parts of the original audio unchanged while performing the editing. Existing methods have effectively solved these problems in the field of image processing by using latent diffusion models (LDM), but their application in the field of audio editing is relatively scarce. The paper proposes a training - free audio editing framework named **AudioEditor**, which is based on pre - trained diffusion models and combines the **Null - text Inversion** and **EOT - suppression** methods to achieve high - precision audio editing and high - quality audio feature preservation. Specifically, the main contributions of AudioEditor include: 1. **Flexible editing**: Users only need to provide the target description and specify the text area to be edited, and AudioEditor can automatically locate and edit the corresponding audio components. 2. **Competitive performance**: By innovatively introducing image processing techniques such as Null - text Inversion and EOT - suppression, the precision of audio editing and the ability to preserve the original audio features are significantly improved. 3. **No training required**: It only relies on pre - trained diffusion models and can achieve high - quality audio editing without training or fine - tuning on specific editing data sets. The paper verifies the effectiveness and superiority of AudioEditor through a series of objective and subjective experiments, demonstrating its potential in the field of audio editing.