Abstract:Text-based editing diffusion models exhibit limited performance when the user's input instruction is ambiguous. To solve this problem, we propose $\textit{Specify ANd Edit}$ (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user's request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not. Our code is public at <a class="link-external link-https" href="https://github.com/fabvio/SANE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limited performance of text - based image - editing diffusion models when the editing instructions provided by users are ambiguous. Specifically, the author observes that existing text - conditional editing methods usually fail to successfully edit images when dealing with ambiguous editing instructions (for example, "Make the dog look cool"). These ambiguous instructions may have multiple interpretations and implementation methods, making it difficult for the model to determine specific editing operations. To solve this problem, the author proposes a zero - sample reasoning pipeline named Specify ANd Edit (SANE), which aims to improve the accuracy and diversity of image editing by using large - language models (LLMs) to break down ambiguous input instructions into specific, well - defined instructions. The main contributions of SANE include: 1. **Proposing an editing method specifically for ambiguous instructions**: This is the first editing method specifically designed to handle ambiguous instructions. 2. **Introducing an LLM - based instruction - decomposition pipeline**: Utilize the reasoning ability and general knowledge of LLMs to break down ambiguous instructions into a series of specific editing tasks. 3. **A conditional mechanism combining ambiguous and specific instructions**: Through a novel denoising - guidance strategy, combine specific editing instructions with the original ambiguous instructions to guide the editing process. Through this method, SANE not only improves the performance of the editing model but also enhances the interpretability of the editing process and can be applied in a zero - sample manner on any pre - trained instruction - driven diffusion model. ### Formula Representation Some of the formulas involved in the paper are as follows: - **Denoising Estimation**: \[ \epsilon_t = f_\theta(z_t, E(x), c) \] where $\epsilon_t$ is the noise estimate at time step $t$, $f_\theta$ is the U - Net model, $z_t$ is the noise in the latent space, $E(x)$ is the encoding of the input image, and $c$ is the editing instruction. - **Combined Noise Estimation**: \[ \tilde{\epsilon}_t=\epsilon^U_t+\epsilon^I_t+\epsilon^C_t+\epsilon^S_t \] where: \[ \epsilon^U_t = f_\theta(z_t, \emptyset, \emptyset) \] \[ \epsilon^I_t = w_I\cdot(f_\theta(z_t, E(x), \emptyset)-f_\theta(z_t, \emptyset, \emptyset)) \] \[ \epsilon^C_t = w_C\cdot(\epsilon_t - f_\theta(z_t, E(x), \emptyset)) \] \[ \epsilon^S_t = w_S\cdot(\bar{\epsilon}^s_t - f_\theta(z_t, E(x), \emptyset)) \] - **Aggregating Noise for Specific Instructions**: \[ \bar{\epsilon}^s_t=\sum_{i}I(M_t = i)\cdot\epsilon^s_{i,t} \] where $I(M_t = i)$ is an indicator function, which is 1 when $M_t$ equals $i$ and 0 otherwise. Through these formulas, SANE can effectively transform ambiguous instructions into specific editing tasks and guide the diffusion model to perform more accurate image editing.

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Learning to Follow Object-Centric Image Editing Instructions Faithfully

Inversion-Free Image Editing with Natural Language

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

InsightEdit: Towards Better Instruction Following for Image Editing

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

An Imitation Learning Curriculum for Text Editing with Non-Autoregressive Models

Unified Concept Editing in Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models