Abstract:Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at <a class="link-external link-https" href="https://github.com/xinchengshuai/Awesome-Image-Editing" rel="external noopener nofollow">this https URL</a>.

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

Text-Driven Image Editing via Learnable Regions

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

Prompt-to-Prompt Image Editing with Cross Attention Control

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

Exploring Text-Guided Single Image Editing for Remote Sensing Images

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

FEAT: Face Editing with Attention

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Image Editing Via Segmentation Guided Self-Attention Network

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

Editing Text in the Wild

Text Guided Image Editing with Automatic Concept Locating and Forgetting