Abstract:Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{<a class="link-external link-https" href="https://github.com/Nnn-s/CATdiffusion" rel="external noopener nofollow">this https URL</a>}.

Locate, Assign, Refine: Taming Customized Promptable Image Inpainting

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

MMGInpainting: Multi-Modality Guided Image Inpainting Based On Diffusion Models

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness

Adaptive Multi-Modality Prompt Learning

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

PromptFix: You Prompt and We Fix the Photo

Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models