Abstract:Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that **text - to - image customization generation models are prone to over - fitting when fine - tuning on a small number of personalized concept images**. Specifically, most of the existing text - to - image customization techniques fine - tune the model on a small number of personal concept images with specific backgrounds, causing the model to over - adapt to these training images and unable to generalize to new backgrounds in future text prompts. This over - fitting makes the generated images often very similar to the training images and not faithful to the new text prompts. #### Main problems: 1. **Concept over - fitting**: Existing methods usually use 4 - 5 images containing personal concepts for fine - tuning, causing the model to over - fit these images and unable to generalize to new backgrounds. 2. **Lack of context diversity**: Due to the limited fine - tuning data set, the model has difficulty learning rich context information, which affects the diversity and accuracy of the generated images. #### Solutions: To solve these problems, the author proposes a method of introducing diverse contexts in the text space. Specifically, they improve the model in the following ways: - **Construct a rich set of text prompts**: By creating diverse text prompts containing personal concepts, the model can learn more extensive context information. - **Adopt masked language modeling (MLM)**: Use the MLM objective function to guide the concept embedding to learn its relationship with the context, thereby improving the semantic consistency of the text representation. - **No change to the model architecture**: This method does not require any modification to the existing model architecture, so it can be seamlessly integrated with existing text - to - image customization methods. Through these improvements, the author shows that their method not only significantly improves the prompt fidelity of the generated images but also achieves a significant CLIP score improvement over multiple benchmark methods. #### Formula summary: - **MLM loss**: \[ L_{\text{MLM}}=\mathbb{E}_{y, P_{\text{masked}}}\left[\text{CrossEntropy}(y, \psi(\Gamma(P_{\text{masked}})))\right] \] - **Total loss function**: \[ L_{\text{total}} = L_{\text{Diff}}(z_t, t, e_C)+\lambda L_{\text{MLM}}(C_{\text{masked}}) \] where \(L_{\text{Diff}}\) is the denoising loss, \(L_{\text{MLM}}\) is the masked language modeling loss, and \(\lambda\) is a weight parameter. Through this method, the author successfully solves the common over - fitting problem in text - to - image generation and significantly improves the quality and diversity of the generated images.

Learning to Customize Text-to-Image Diffusion In Diverse Context

AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

Multi-Concept Customization of Text-to-Image Diffusion

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Customizing Text-to-Image Diffusion with Object Viewpoint Control

Customization Assistant for Text-to-image Generation

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

CustomText: Customized Textual Image Generation using Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

Customized Generation Reimagined: Fidelity and Editability Harmonized

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

Tuning-Free Image Customization with Image and Text Guidance

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization