Abstract:Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing diffusion - model - based methods in semantic consistency, user editing habits, and local detail generation in the image inpainting task. Specifically:
1. **Semantic Consistency**: Existing methods often fail to ensure the semantic consistency between the generated content and the text prompt when generating the content of the occluded area, especially in local areas.
2. **User Editing Habits**: Existing methods do not fully simulate users' editing habits, such as the diverse needs of users when applying masks.
3. **Local Detail Generation**: Most methods rely on global text prompts and lack descriptions of local details, resulting in the generated images may not meet expectations in local areas.
To solve these problems, the authors propose PainterNet, a plug - in framework that can be flexibly embedded into various diffusion models. PainterNet improves the effect of image inpainting through the following innovations:
- **Local Text Prompt Input**: It introduces local text prompts, enabling the model to better understand and generate high - quality content related to local areas.
- **Attention Control Points (ACP) and Actual - Token Attention Loss (ATAL)**: These mechanisms enhance the model's attention to the masked area and ensure high consistency between the generated content and the text prompt.
- **Diverse Mask Generation Strategies**: It redesigns the mask generation algorithms in the training and test datasets to simulate users' real - world usage habits and introduces a new training dataset PainterData and a benchmark dataset PainterBench.
Through these improvements, PainterNet outperforms the existing state - of - the - art models in key metrics such as image quality and global/local text consistency.
### Formula Display
1. **Diffusion Loss**:
\[
L_{\text{diff}}=\mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0,1), t, c, h}\left[\left\|\epsilon-\epsilon_\theta(z_t, t, c)\right\|_2^2\right]
\]
where $\epsilon \sim \mathcal{N}(0,1)$ is randomly sampled Gaussian noise, $t\in[1, T]$ is the time step, $T$ is the total number of time steps, $c = \tau(P)$ is the text embedding, $z_0 = E(x_0)$ is the latent representation of $x_0$, and $z_t=\alpha_t z_0+\sigma_t \epsilon$.
2. **Actual - Token Attention Loss**:
\[
L_{\text{ATAL}}=\frac{1}{N}\sum_{i = 1}^{N}\frac{1}{L_S}\sum_{j\in S}\left\|A_{i,j}-m_i\right\|_2^2
\]
where $L_S$ is the length of the set $S$, $A_{i,j}\in\mathbb{R}^{H\times W\times1}$ represents the $j$-th actual text token in the $i$-th layer cross - attention map of PainterNet, and $m_i\in\mathbb{R}^{H\times W\times1}$ is the resized mask.
3. **Total Loss Function**:
\[
L = L_{\text{diff}}+\beta L_{\text{ATAL}}
\]
where $\beta$ is a hyperparameter used to adjust the influence of the ATAL loss.
Through these formulas and mechanisms, PainterNet can more accurately capture the detailed information of the masked area and generate high - quality