Abstract:Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{<a class="link-external link-https" href="https://github.com/Nnn-s/CATdiffusion" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in text - guided image inpainting tasks, how to generate new objects with high quality and semantically consistent with text prompts. Specifically, there are two main challenges: 1. **It is difficult to align text prompts and visual objects relying solely on a single U - Net**: During the entire denoising process, a single U - Net is not sufficient to generate the desired objects. 2. **Controllability issues in complex sampling spaces**: In the complex sampling space of diffusion models, it is challenging to precisely control the generation of visual objects without additional control signals. To solve these problems, the authors propose a new framework - **CAT - Diffusion (Cascaded Transformer - Diffusion)**, which decomposes the traditional single - stage object inpainting process into two cascaded processes: 1. **Semantic Pre - inpainting**: Infer the semantic features of the target object in the multimodal feature space. 2. **High - fidelity Object Generation**: Use these semantic features as visual prompts to guide the diffusion model for controllable object generation. In this way, CAT - Diffusion can better align text prompts and visual objects and improve the quality and controllability of object generation. ### Specific improvement methods 1. **Semantic Pre - inpainting module**: - Use a Transformer - based semantic inpainter to predict the semantic features of the target object, conditioned on the unoccluded context and text prompts. - Through the knowledge distillation technique, transfer the knowledge of pre - trained multimodal models (such as CLIP) to the semantic inpainter to ensure that the generated semantic features are naturally aligned with the text prompts. 2. **Reference Adapter Layer**: - Use the output of the semantic inpainter as an additional condition to enhance the controllability of the diffusion model through the reference adapter layer. - This layer gracefully adjusts the visual prompts and improves the controllability and quality of the diffusion model in object inpainting. 3. **Experimental verification**: - Conducted extensive experiments on the OpenImages - V6 and MSCOCO datasets to verify the superiority of CAT - Diffusion in generating high - quality objects. - Quantitative evaluation metrics include FID, Local FID, and CLIP scores, and qualitative evaluation shows better semantic alignment and visual consistency. In conclusion, this paper significantly improves the effect of text - guided object inpainting tasks by introducing the semantic pre - inpainting and reference adapter layers, and solves the alignment and controllability problems in existing methods.

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

MIGT: Multi-modal Image Inpainting Guided with Text.

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

MMGInpainting: Multi-Modality Guided Image Inpainting Based On Diffusion Models

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

Text Image Inpainting via Global Structure-Guided Diffusion Models

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Delving Globally into Texture and Structure for Image Inpainting

Text-Guided Texturing by Synchronized Multi-View Diffusion

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Neural Image Inpainting Guided with Descriptive Text

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

Text-image Alignment for Diffusion-based Perception

Mutual Dual-task Generator with Adaptive Attention Fusion for Image Inpainting