Pix2Video: Video Editing using Image Diffusion

Duygu Ceylan,Chun-Hao Paul Huang,Niloy J. Mitra
2023-03-23
Abstract:Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper explores how to use pre-trained image diffusion models for text-guided video editing. Specifically: 1. **Problem Background**: Although large-scale image generation models (such as those based on diffusion processes) have made significant progress in image generation and can support high-quality image editing applications, their application in the field of video editing is still in its infancy. Directly applying image generation methods to video frame sequences leads to inconsistent results. 2. **Main Challenges**: How to achieve the target edits while maintaining the content of the source video. This requires ensuring that the edited results are visually coherent and consistent with the text prompts. 3. **Solution**: A two-step strategy is proposed: - First, use a pre-trained structure-guided image diffusion model to perform text-guided editing on anchor frames. - Then, propagate the changes to future frames through self-attention feature injection and adjust the latent codes to consolidate these changes. 4. **Advantages**: This method does not require additional training, is suitable for a wide range of editing tasks, and its effectiveness is validated through extensive experiments, comparing it with four different existing methods. The results show that this method can achieve realistic text-guided video editing without computationally intensive preprocessing or video-specific fine-tuning.