I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Wenqi Ouyang,Yi Dong,Lei Yang,Jianlou Si,Xingang Pan
2024-05-26
Abstract:The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper introduces I2VEdit, a video editing method based on the first frame edit, which extends single-frame edits to the entire video using an image-to-video diffusion model. Currently, despite the rapid development of image editing tools and techniques, video editing faces additional challenges in the time dimension. I2VEdit utilizes a pre-trained image-to-video model to propagate the edits made by the user on the first frame to the entire video while adaptively maintaining the visual appearance and motion consistency of the source video. The main issues mentioned in the paper include: 1. Existing video editing methods often limit themselves to specific types of editing tasks, such as global style transfer, and struggle to achieve fine-grained local edits or handle structural changes. 2. Although some methods utilize optical flow or depth maps to maintain temporal consistency, they may fail to generate high-quality videos while preserving the spatial appearance consistency with the source video. To address these issues, I2VEdit proposes two key processes: - Coarse Motion Extraction: Aligns the basic motion patterns with the original video through training motion LoRAs (low-rank adaptations) that skip intervals. - Appearance Refinement: Utilizes fine-grained attention matching for precise adjustments to accommodate different levels of structural changes. Additionally, the paper introduces a technique called Smoothed Area Random Perturbation (SARP) to improve the reverse sampling of the deterministic diffusion model and a skip interval strategy to reduce quality degradation during long video edits. Experimental results demonstrate that I2VEdit performs well in fine-grained video editing, producing high-quality and temporally consistent outputs, showcasing its potential for extending existing image editing methods to the domain of videos. Compared to existing methods, I2VEdit offers greater flexibility in local edits and improves visual editing quality due to the superiority of its underlying image editing methods.