SeedEdit: Align Image Re-Generation to Image Editing

Yichun Shi,Peng Wang,Weilin Huang
2024-11-11
Abstract:We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining the original image, i.e. image reconstruction, and generating a new image, i.e. image re-generation. To this end, we start from a weak generator (text-to-image model) that creates diverse pairs between such two directions and gradually align it into a strong image editor that well balances between the two tasks. SeedEdit can achieve more diverse and stable editing capability over prior image editing methods, enabling sequential revision over images generated by diffusion models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address two major challenges in image editing: the balance between **Image Reconstruction** and **Image Re-generation**. ### Specific Problem Description: 1. **Insufficient Controllability in Image Editing**: - Current diffusion models can generate realistic and diverse images from text descriptions, but these generated images are often difficult to control. The generation process is more like "rolling the dice" until a good output is seen. - To achieve better control over the generated content, a method is needed to modify the input image according to text instructions, i.e., **instruction-based image editing**. 2. **Limitations of Existing Methods**: - **Training-free Methods**: These methods combine specific techniques such as DDIM inversion, test-time fine-tuning, and attention control to reconstruct the input image and generate new images. However, due to the instability of the reconstruction and re-generation processes, these methods accumulate more errors in the edited images, resulting in outputs that are inconsistent with the input image or target description. - **Data-driven Methods**: These methods require the preparation of large-scale paired editing datasets to train instruction-based diffusion models. However, preparing diverse and high-quality editing datasets is very challenging because image editing pairs are very rare, making it almost impossible to collect a high-quality dataset that covers all types of editing pairs. ### Proposed Solution in the Paper: - **SeedEdit Framework**: This framework aims to transform an image generation diffusion model into an image editing model. By gradually aligning the generation model, it achieves the optimal balance between image reconstruction and re-generation. - **Data Generation and Model Optimization**: First, a pre-trained text-to-image (T2I) model is used to generate diverse paired data, then through iterative data sampling and model optimization, the diffusion model is gradually aligned to achieve the best editing effect. - **Causal Diffusion Model**: A causal diffusion model is proposed, which can handle both image and text conditions simultaneously to improve the accuracy of editing and the consistency of the image. ### Experimental Results: - Experimental results on the HQ-Edit and Emu Edit benchmark datasets show that SeedEdit significantly outperforms existing methods in editing performance, especially on the HQ-Edit dataset, with higher CLIP image similarity, indicating better retention of the original image content. - Quality evaluation results show that SeedEdit has a higher success rate in handling vague instructions and fine-grained edits. In conclusion, by proposing the SeedEdit framework, the paper successfully addresses the balance between image reconstruction and re-generation in image editing, improving the controllability and accuracy of image editing.