Abstract:Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits. Code and demo can be found at <a class="link-external link-https" href="https://github.com/NKU-HLT/AudioEditor" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of performing precise editing and preserving the unedited parts in audio editing tasks. Although text - to - audio (TTA) generation based on diffusion models has made remarkable progress, there are still two main difficulties in the field of audio editing: 1. **Performing precise editing**: How to accurately modify specific parts in the audio during the editing process without affecting other parts. 2. **Preserving unedited parts**: How to keep the unedited parts of the original audio unchanged while performing the editing. Existing methods have effectively solved these problems in the field of image processing by using latent diffusion models (LDM), but their application in the field of audio editing is relatively scarce. The paper proposes a training - free audio editing framework named **AudioEditor**, which is based on pre - trained diffusion models and combines the **Null - text Inversion** and **EOT - suppression** methods to achieve high - precision audio editing and high - quality audio feature preservation. Specifically, the main contributions of AudioEditor include: 1. **Flexible editing**: Users only need to provide the target description and specify the text area to be edited, and AudioEditor can automatically locate and edit the corresponding audio components. 2. **Competitive performance**: By innovatively introducing image processing techniques such as Null - text Inversion and EOT - suppression, the precision of audio editing and the ability to preserve the original audio features are significantly improved. 3. **No training required**: It only relies on pre - trained diffusion models and can achieve high - quality audio editing without training or fine - tuning on specific editing data sets. The paper verifies the effectiveness and superiority of AudioEditor through a series of objective and subjective experiments, demonstrating its potential in the field of audio editing.

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

Prompt-guided Precise Audio Editing with Diffusion Models

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Controllable Text-to-Audio Generation with Training-Free Temporal Guidance Diffusion

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

AudioDiffusion: Generating High-Quality Audios from EEG Signals : Reconstructing Audio from EEG Signals

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models