Abstract:In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual sample. Secondly, we introduce a cross-modal semantic enhancement approach. We observe that when using language as content editing guidance, the vision branch may overlook editing requirements. This phenomenon, termed catastrophic neglect, hampers audio-visual alignment during content editing. We therefore enhance semantic consistency between language and vision to mitigate this issue. Extensive experiments validate the effectiveness of our method in language-based audio-visual editing and highlight its superiority over several baseline approaches. We recommend that readers visit our project page for more details: <a class="link-external link-https" href="https://liangsusan-git.github.io/project/avedit/" rel="external noopener nofollow">this https URL</a>.

Soundini: Sound-Guided Diffusion for Natural Video Editing

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

DNI: Dilutional Noise Initialization for Diffusion Video Editing

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Soundify: Matching Sound Effects to Video

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Speech driven video editing via an audio-conditioned diffusion model

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

AudioScenic: Audio-Driven Video Scene Editing

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Self-Supervised Audio-Visual Soundscape Stylization

Visual to Sound: Generating Natural Sound for Videos in the Wild

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion