Abstract:Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at <a class="link-external link-https" href="https://github.com/hqhQAQ/PatchDPO" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve the problems of low - quality generated images and inconsistency with reference images in finetuning - free personalized image generation. Specifically, the existing finetuning - free methods are only trained through a simple image reconstruction task, and the images generated during testing are usually inconsistent with the reference images in local details, resulting in low - quality generated images. To improve this problem, the paper introduces the PatchDPO method.
### Main contributions of PatchDPO
1. **Construct a high - quality dataset**: A high - quality dataset is constructed for PatchDPO training to ensure the effectiveness of model training.
2. **Propose a self - supervised training method based on pre - trained visual models**: It is used to evaluate the quality of each patch in the generated image, so as to provide more accurate feedback for model optimization.
3. **Propose a weighted training method**: The model is optimized according to the estimated patch quality, so that the model can retain high - quality patches and correct low - quality patches.
4. **Experimental results show that PatchDPO has achieved state - of - the - art performance in single - object and multi - object personalized image generation tasks**.
### Specific problem description
The current finetuning - free methods have the following problems:
- **Low - quality generated images**: The generated images are inconsistent with the reference images in local details, especially in some local areas (as shown in Figure 1, the head, back and legs of the generated image are inconsistent with the reference image).
- **Limitations of the traditional DPO method**: The traditional DPO method only judges the overall quality of the entire image and cannot handle the situation where the generated image is inconsistent with the reference image in local areas.
### Solutions
To solve the above problems, PatchDPO proposes the following solutions:
- **Data construction**: Construct a training dataset containing multiple pairs of reference images and generated images.
- **Patch quality evaluation**: Use a pre - trained visual model to extract image features, and improve feature extraction through a self - supervised training method, and then evaluate the quality of each patch in the generated image.
- **Model optimization**: Adopt a weighted training method to optimize the model according to the estimated patch quality, so that the model can retain high - quality patches and correct low - quality patches.
Through these improvements, PatchDPO significantly improves the quality of finetuning - free personalized image generation and achieves state - of - the - art performance on multiple benchmark datasets.
### Formula representation
When evaluating the patch quality, the following formula is used:
\[ p(x_{\text{gen}}[h, w]) = \max_{i,j} \frac{f(x_{\text{gen}})[h, w] \cdot f(x_{\text{ref}})[i, j]}{\|f(x_{\text{gen}})[h, w]\| \|f(x_{\text{ref}})[i, j]\|} \]
where \( x_{\text{gen}} \) and \( x_{\text{ref}} \) represent the generated image and the reference image respectively, \( f \) represents the pre - trained visual model, and \( p(x_{\text{gen}}[h, w]) \) represents the quality of the patch in the \( h \) - th row and \( w \) - th column of the generated image.
In the model optimization stage, the loss function \( L_{\text{PatchDPO}} \) is expressed as:
\[ L_{\text{PatchDPO}} = \left\| [\epsilon_{\text{gen}} - \epsilon_\theta(x_{\text{gen}}(t), c_{\text{text}}, x_{\text{ref}}, t)] \odot \tilde{p}(x_{\text{gen}}) \right\|_2^2 + \left\| [\epsilon_{\text{ref}} - \epsilon_\theta(x_{\text{ref}}(