Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Ahmet Burak Yildirim,Vedat Baday,Erkut Erdem,Aykut Erdem,Aysegul Dundar
2023-08-10
Abstract:Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to solve a key problem in the task of image inpainting: how to automatically remove specified objects from an image based on natural language instructions, without requiring the user to provide a binary mask to define the area to be removed. ### Background and Motivation Traditional image inpainting methods typically rely on binary masks to define the objects to be removed or the missing areas. These masks are usually manually drawn by users, which is not only time-consuming but also prone to errors. In recent years, text-based image generation and editing technologies have made significant progress, especially models trained on large-scale image-text datasets, such as DALL·E2 and Stable Diffusion, which have shown powerful generative capabilities. However, when applied to image inpainting, these methods usually require additional binary masks to guide the inpainting process, which limits their practicality. ### Research Objectives To overcome the above issues, this paper proposes a new task—instructional image inpainting, which specifies the objects to be removed solely through natural language instructions, without any binary masks. Specifically, the main contributions of this paper include: 1. **Constructing a New Dataset**: Based on the GQA dataset, a new benchmark dataset named GQA-Inpaint is constructed for training and evaluating instructional image inpainting models. 2. **Designing a New Inpainting Framework**: A single-stage deep inpainting network named Inst-Inpaint is proposed, which can automatically remove objects from images based on text instructions without explicitly predicting binary masks. 3. **Experimental Validation**: Extensive experiments demonstrate the effectiveness of this framework, achieving significant improvements on multiple evaluation metrics, especially in text-based image inpainting tasks. ### Method Overview 1. **Dataset Generation**: - Select objects from the scene graphs of the GQA dataset. - Use instance segmentation methods to extract the segmentation masks of objects. - Apply advanced image inpainting methods (e.g., CRFill) to remove objects. - Generate text instructions based on the scene graphs. 2. **Model Design**: - Construct a conditional diffusion model Inst-Inpaint based on the latent diffusion model. - The model inputs include the source image and text instructions, projecting the image into a low-dimensional latent space through an encoder and reconstructing the image through a decoder. - During the diffusion process, noise is gradually added, and the denoised latent representation is predicted at each time step. - Introduce a cross-attention mechanism to handle text conditions, enabling the model to remove objects based on text instructions. 3. **Experiments and Evaluation**: - Conduct experiments on synthetic and real image datasets. - Use various evaluation metrics (including CLIP-based inpainting scores) to compare the performance of different methods. - Demonstrate the significant advantages of Inst-Inpaint in text-based image inpainting tasks. ### Conclusion The proposed Inst-Inpaint framework can automatically remove specified objects from images based on natural language instructions without relying on user-provided binary masks, significantly improving the efficiency and accuracy of image inpainting. This research provides a new solution for image inpainting tasks, with important practical application value.