Abstract:Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to solve a key problem in the task of image inpainting: how to automatically remove specified objects from an image based on natural language instructions, without requiring the user to provide a binary mask to define the area to be removed. ### Background and Motivation Traditional image inpainting methods typically rely on binary masks to define the objects to be removed or the missing areas. These masks are usually manually drawn by users, which is not only time-consuming but also prone to errors. In recent years, text-based image generation and editing technologies have made significant progress, especially models trained on large-scale image-text datasets, such as DALL·E2 and Stable Diffusion, which have shown powerful generative capabilities. However, when applied to image inpainting, these methods usually require additional binary masks to guide the inpainting process, which limits their practicality. ### Research Objectives To overcome the above issues, this paper proposes a new task—instructional image inpainting, which specifies the objects to be removed solely through natural language instructions, without any binary masks. Specifically, the main contributions of this paper include: 1. **Constructing a New Dataset**: Based on the GQA dataset, a new benchmark dataset named GQA-Inpaint is constructed for training and evaluating instructional image inpainting models. 2. **Designing a New Inpainting Framework**: A single-stage deep inpainting network named Inst-Inpaint is proposed, which can automatically remove objects from images based on text instructions without explicitly predicting binary masks. 3. **Experimental Validation**: Extensive experiments demonstrate the effectiveness of this framework, achieving significant improvements on multiple evaluation metrics, especially in text-based image inpainting tasks. ### Method Overview 1. **Dataset Generation**: - Select objects from the scene graphs of the GQA dataset. - Use instance segmentation methods to extract the segmentation masks of objects. - Apply advanced image inpainting methods (e.g., CRFill) to remove objects. - Generate text instructions based on the scene graphs. 2. **Model Design**: - Construct a conditional diffusion model Inst-Inpaint based on the latent diffusion model. - The model inputs include the source image and text instructions, projecting the image into a low-dimensional latent space through an encoder and reconstructing the image through a decoder. - During the diffusion process, noise is gradually added, and the denoised latent representation is predicted at each time step. - Introduce a cross-attention mechanism to handle text conditions, enabling the model to remove objects based on text instructions. 3. **Experiments and Evaluation**: - Conduct experiments on synthetic and real image datasets. - Use various evaluation metrics (including CLIP-based inpainting scores) to compare the performance of different methods. - Demonstrate the significant advantages of Inst-Inpaint in text-based image inpainting tasks. ### Conclusion The proposed Inst-Inpaint framework can automatically remove specified objects from images based on natural language instructions without relying on user-provided binary masks, significantly improving the efficiency and accuracy of image inpainting. This research provides a new solution for image inpainting tasks, with important practical application value.

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Fill in the ____ (a Diffusion-based Image Inpainting Pipeline)

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

MMGInpainting: Multi-Modality Guided Image Inpainting Based On Diffusion Models

Face Image Inpainting Based on Generative Adversarial Network

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

Outline-Guided Object Inpainting with Diffusion Models

DiffGANPaint: Fast Inpainting Using Denoising Diffusion GANs

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

A Hybrid Inpainting Model Combining Diffusion and Enhanced Exemplar Methods

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All