Abstract:Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at <a class="link-external link-https" href="https://github.com/hqhQAQ/PatchDPO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problems of low - quality generated images and inconsistency with reference images in finetuning - free personalized image generation. Specifically, the existing finetuning - free methods are only trained through a simple image reconstruction task, and the images generated during testing are usually inconsistent with the reference images in local details, resulting in low - quality generated images. To improve this problem, the paper introduces the PatchDPO method. ### Main contributions of PatchDPO 1. **Construct a high - quality dataset**: A high - quality dataset is constructed for PatchDPO training to ensure the effectiveness of model training. 2. **Propose a self - supervised training method based on pre - trained visual models**: It is used to evaluate the quality of each patch in the generated image, so as to provide more accurate feedback for model optimization. 3. **Propose a weighted training method**: The model is optimized according to the estimated patch quality, so that the model can retain high - quality patches and correct low - quality patches. 4. **Experimental results show that PatchDPO has achieved state - of - the - art performance in single - object and multi - object personalized image generation tasks**. ### Specific problem description The current finetuning - free methods have the following problems: - **Low - quality generated images**: The generated images are inconsistent with the reference images in local details, especially in some local areas (as shown in Figure 1, the head, back and legs of the generated image are inconsistent with the reference image). - **Limitations of the traditional DPO method**: The traditional DPO method only judges the overall quality of the entire image and cannot handle the situation where the generated image is inconsistent with the reference image in local areas. ### Solutions To solve the above problems, PatchDPO proposes the following solutions: - **Data construction**: Construct a training dataset containing multiple pairs of reference images and generated images. - **Patch quality evaluation**: Use a pre - trained visual model to extract image features, and improve feature extraction through a self - supervised training method, and then evaluate the quality of each patch in the generated image. - **Model optimization**: Adopt a weighted training method to optimize the model according to the estimated patch quality, so that the model can retain high - quality patches and correct low - quality patches. Through these improvements, PatchDPO significantly improves the quality of finetuning - free personalized image generation and achieves state - of - the - art performance on multiple benchmark datasets. ### Formula representation When evaluating the patch quality, the following formula is used: \[ p(x_{\text{gen}}[h, w]) = \max_{i,j} \frac{f(x_{\text{gen}})[h, w] \cdot f(x_{\text{ref}})[i, j]}{\|f(x_{\text{gen}})[h, w]\| \|f(x_{\text{ref}})[i, j]\|} \] where \( x_{\text{gen}} \) and \( x_{\text{ref}} \) represent the generated image and the reference image respectively, \( f \) represents the pre - trained visual model, and \( p(x_{\text{gen}}[h, w]) \) represents the quality of the patch in the \( h \) - th row and \( w \) - th column of the generated image. In the model optimization stage, the loss function \( L_{\text{PatchDPO}} \) is expressed as: \[ L_{\text{PatchDPO}} = \left\| [\epsilon_{\text{gen}} - \epsilon_\theta(x_{\text{gen}}(t), c_{\text{text}}, x_{\text{ref}}, t)] \odot \tilde{p}(x_{\text{gen}}) \right\|_2^2 + \left\| [\epsilon_{\text{ref}} - \epsilon_\theta(x_{\text{ref}}(

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

Scalable Ranked Preference Optimization for Text-to-Image Generation

Patched Denoising Diffusion Models For High-Resolution Image Synthesis

Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

Patching in Order: Efficient On-Device Model Fine-Tuning for Multi-DNN Vision Applications

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Personalized Restoration via Dual-Pivot Tuning

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization

DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Personalized Image Generation with Large Multimodal Models

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models