Abstract:Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of existing diffusion - model - based methods in semantic consistency, user editing habits, and local detail generation in the image inpainting task. Specifically: 1. **Semantic Consistency**: Existing methods often fail to ensure the semantic consistency between the generated content and the text prompt when generating the content of the occluded area, especially in local areas. 2. **User Editing Habits**: Existing methods do not fully simulate users' editing habits, such as the diverse needs of users when applying masks. 3. **Local Detail Generation**: Most methods rely on global text prompts and lack descriptions of local details, resulting in the generated images may not meet expectations in local areas. To solve these problems, the authors propose PainterNet, a plug - in framework that can be flexibly embedded into various diffusion models. PainterNet improves the effect of image inpainting through the following innovations: - **Local Text Prompt Input**: It introduces local text prompts, enabling the model to better understand and generate high - quality content related to local areas. - **Attention Control Points (ACP) and Actual - Token Attention Loss (ATAL)**: These mechanisms enhance the model's attention to the masked area and ensure high consistency between the generated content and the text prompt. - **Diverse Mask Generation Strategies**: It redesigns the mask generation algorithms in the training and test datasets to simulate users' real - world usage habits and introduces a new training dataset PainterData and a benchmark dataset PainterBench. Through these improvements, PainterNet outperforms the existing state - of - the - art models in key metrics such as image quality and global/local text consistency. ### Formula Display 1. **Diffusion Loss**: \[ L_{\text{diff}}=\mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0,1), t, c, h}\left[\left\|\epsilon-\epsilon_\theta(z_t, t, c)\right\|_2^2\right] \] where $\epsilon \sim \mathcal{N}(0,1)$ is randomly sampled Gaussian noise, $t\in[1, T]$ is the time step, $T$ is the total number of time steps, $c = \tau(P)$ is the text embedding, $z_0 = E(x_0)$ is the latent representation of $x_0$, and $z_t=\alpha_t z_0+\sigma_t \epsilon$. 2. **Actual - Token Attention Loss**: \[ L_{\text{ATAL}}=\frac{1}{N}\sum_{i = 1}^{N}\frac{1}{L_S}\sum_{j\in S}\left\|A_{i,j}-m_i\right\|_2^2 \] where $L_S$ is the length of the set $S$, $A_{i,j}\in\mathbb{R}^{H\times W\times1}$ represents the $j$-th actual text token in the $i$-th layer cross - attention map of PainterNet, and $m_i\in\mathbb{R}^{H\times W\times1}$ is the resized mask. 3. **Total Loss Function**: \[ L = L_{\text{diff}}+\beta L_{\text{ATAL}} \] where $\beta$ is a hyperparameter used to adjust the influence of the ATAL loss. Through these formulas and mechanisms, PainterNet can more accurately capture the detailed information of the masked area and generate high - quality

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

Single-Mask Inpainting for Voxel-Based Neural Radiance Fields

Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

AttentionPainter: An Efficient and Adaptive Stroke Predictor for Scene Painting

A Progressive Image Inpainting Algorithm with a Mask Auto-update Branch

Learning Adaptive Patch Generators for Mask-Robust Image Inpainting.

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

High-Resolution Image Inpainting Based On Multi-Scale Neural Network

Face Image Inpainting Based on Generative Adversarial Network

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

MMGInpainting: Multi-Modality Guided Image Inpainting Based On Diffusion Models

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference

Unsupervised masked face inpainting based on contrastive learning and attention mechanism

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

RePaint: Inpainting using Denoising Diffusion Probabilistic Models