InsightEdit: Towards Better Instruction Following for Image Editing

Yingjing Xu,Jie Kong,Jiazhi Wang,Xiao Pan,Bo Lin,Qiang Liu

2024-11-26

Abstract:In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Lack of high - quality datasets**: Existing image - editing datasets have problems such as low resolution, poor background consistency, and overly simple and templated instructions. These problems limit the ability of image - editing models to handle complex instructions and generate high - fidelity target images. 2. **Insufficient rich image conditions**: Current methods mainly rely on text conditions (such as using the CLIP text encoder) and ignore the rich visual semantic information in the image. This leads to poor performance of existing methods in handling complex instructions and maintaining background consistency. To address the above challenges, the authors propose the following solutions: - **Construct an automated data - generation pipeline**: Through this pipeline, high - quality image - editing pairs with complex instructions and good background consistency are generated, thus solving the problem of low - quality datasets. - **Introduce a two - stream bridging mechanism**: This mechanism utilizes text and visual features extracted by multi - modal large - language models (MLLM) to more accurately guide the image - editing process, solving the problem of insufficient image conditions. Through these innovations, the method InsightEdit proposed in the paper performs excellently in following complex instructions and maintaining background consistency, achieving state - of - the - art performance.

InsightEdit: Towards Better Instruction Following for Image Editing

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

InstructGIE: Towards Generalizable Image Editing

Guiding Instruction-based Image Editing via Multimodal Large Language Models

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Learning to Follow Object-Centric Image Editing Instructions Faithfully

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

Multi-Reward as Condition for Instruction-based Image Editing

Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models