Pix2Video: Video Editing using Image Diffusion

Duygu Ceylan,Chun-Hao Paul Huang,Niloy J. Mitra

2023-03-23

Abstract:Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper explores how to use pre-trained image diffusion models for text-guided video editing. Specifically: 1. **Problem Background**: Although large-scale image generation models (such as those based on diffusion processes) have made significant progress in image generation and can support high-quality image editing applications, their application in the field of video editing is still in its infancy. Directly applying image generation methods to video frame sequences leads to inconsistent results. 2. **Main Challenges**: How to achieve the target edits while maintaining the content of the source video. This requires ensuring that the edited results are visually coherent and consistent with the text prompts. 3. **Solution**: A two-step strategy is proposed: - First, use a pre-trained structure-guided image diffusion model to perform text-guided editing on anchor frames. - Then, propagate the changes to future frames through self-attention feature injection and adjust the latent codes to consolidate these changes. 4. **Advantages**: This method does not require additional training, is suitable for a wide range of editing tasks, and its effectiveness is validated through extensive experiments, comparing it with four different existing methods. The results show that this method can achieve realistic text-guided video editing without computationally intensive preprocessing or video-specific fine-tuning.

Pix2Video: Video Editing using Image Diffusion

Pix2Video: Video Editing using Image Diffusion

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Structure and Content-Guided Video Synthesis with Diffusion Models

Dreamix: Video Diffusion Models are General Video Editors

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models

High-Fidelity Diffusion Editor for Zero-Shot Text-Guided Video Editing

Edit Temporal-Consistent Videos with Image Diffusion Model

InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing

Pathways on the Image Manifold: Image Editing via Video Generation

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

PRedItOR: Text Guided Image Editing with Diffusion Prior