Abstract:Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can accept multiple forms of user instructions as input. We provide the code at <a class="link-external link-https" href="https://github.com/QianWangX/InstructEdit" rel="external noopener nofollow">this https URL</a>.

TiBERT: A Non-autoregressive Pre-trained Model for Text Editing.

CodeEditor: Learning to Edit Source Code with Pre-trained Models

Editing Text in the Wild

EditEval: An Instruction-Based Benchmark for Text Improvements

Recurrent Inference in Text Editing

Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion

Towards a Unified Training for Levenshtein Transformer

Improving Diffusion Models for Scene Text Editing with Dual Encoders

An Imitation Learning Curriculum for Text Editing with Non-Autoregressive Models

TiBERT: Tibetan Pre-trained Language Model

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Multi-Task Fine-Tuning on BERT Using Spelling Errors Correction for Chinese Text Classification Robustness

A Small and Fast BERT for Chinese Medical Punctuation Restoration

CoEdIT: Text Editing by Task-Specific Instruction Tuning

DiffUTE: Universal Text Editing Diffusion Model

XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Text Generation with Text-Editing Models