Abstract:Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in image editing: 1. **The challenge of precise editing**: Existing image - editing models are often not accurate enough when dealing with complex editing instructions, and it is difficult to modify completely according to the user's intention. 2. **The challenge of the fidelity of the original image**: These models often make unnecessary changes to the key elements of the original image during the editing process, resulting in image distortion. To solve these problems, the author proposes a new method, that is, redefining image editing as a video - generation task. Specifically, they use a pre - trained image - to - video model to create a smooth transition from the original image to the target - edited image. This method ensures the consistency of editing by continuously traversing the image manifold, while retaining the key features of the original image. ### Main contributions 1. **Redefining image editing as a video - generation task**: - By introducing temporal coherence, an editing path is created on the natural - image manifold to achieve high - fidelity image manipulation while maintaining the characteristics of the source image. 2. **Proposing the Frame2Frame framework**: - It contains three main steps: 1. **Temporal Editing Caption**: Use a pre - trained vision - language model (VLM) to generate text describing the editing process. 2. **Video - based editing generation**: Use a pre - trained image - to - video model to generate a coherent time series. 3. **Automatic frame selection**: Select the frame that best meets the editing intention from the generated video as the final editing result. 3. **Comprehensive evaluation**: - Experiments on the TedBench and PosEdit datasets show that this method has reached the state - of - the - art level in both editing accuracy and source - image fidelity. ### Method overview 1. **Temporal Editing Caption**: - Convert the editing instructions into temporal captions describing the editing process, for example, "A person slowly makes a heart - shaped gesture with his hands". 2. **Video generation**: - Use pre - trained image - to - video models such as CogVideoX to generate a video sequence containing multiple frames, each frame representing an intermediate state from the source image to the target - edited image. 3. **Frame selection**: - Automatically select the frame that can best achieve the editing intention, avoid manual frame - by - frame review, and ensure the best editing effect. ### Experimental results - **Quantitative evaluation**: The results on the TEdBench and PosEdit datasets show that this method is superior to existing methods in multiple metrics such as LPIPS, CLIP - I and CLIP. - **Qualitative evaluation**: The generated images not only more accurately reflect the editing intention, but also better retain the content and structure of the original image. In this way, this paper provides a novel and effective image - editing method, which solves the deficiencies of existing techniques in terms of precision and fidelity.

Pathways on the Image Manifold: Image Editing via Video Generation

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Pix2Video: Video Editing using Image Diffusion

Dreamix: Video Diffusion Models are General Video Editors

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Temporally Consistent Object Editing in Videos using Extended Attention

Imagic: Text-Based Real Image Editing with Diffusion Models

Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model

Diffusion Model-Based Video Editing: A Survey

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

Structure and Content-Guided Video Synthesis with Diffusion Models

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

Stitch it in Time: GAN-Based Facial Editing of Real Videos