Learning to Customize Text-to-Image Diffusion In Diverse Context

Taewook Kim,Wei Chen,Qiang Qiu
2024-10-14
Abstract:Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that **text - to - image customization generation models are prone to over - fitting when fine - tuning on a small number of personalized concept images**. Specifically, most of the existing text - to - image customization techniques fine - tune the model on a small number of personal concept images with specific backgrounds, causing the model to over - adapt to these training images and unable to generalize to new backgrounds in future text prompts. This over - fitting makes the generated images often very similar to the training images and not faithful to the new text prompts. #### Main problems: 1. **Concept over - fitting**: Existing methods usually use 4 - 5 images containing personal concepts for fine - tuning, causing the model to over - fit these images and unable to generalize to new backgrounds. 2. **Lack of context diversity**: Due to the limited fine - tuning data set, the model has difficulty learning rich context information, which affects the diversity and accuracy of the generated images. #### Solutions: To solve these problems, the author proposes a method of introducing diverse contexts in the text space. Specifically, they improve the model in the following ways: - **Construct a rich set of text prompts**: By creating diverse text prompts containing personal concepts, the model can learn more extensive context information. - **Adopt masked language modeling (MLM)**: Use the MLM objective function to guide the concept embedding to learn its relationship with the context, thereby improving the semantic consistency of the text representation. - **No change to the model architecture**: This method does not require any modification to the existing model architecture, so it can be seamlessly integrated with existing text - to - image customization methods. Through these improvements, the author shows that their method not only significantly improves the prompt fidelity of the generated images but also achieves a significant CLIP score improvement over multiple benchmark methods. #### Formula summary: - **MLM loss**: \[ L_{\text{MLM}}=\mathbb{E}_{y, P_{\text{masked}}}\left[\text{CrossEntropy}(y, \psi(\Gamma(P_{\text{masked}})))\right] \] - **Total loss function**: \[ L_{\text{total}} = L_{\text{Diff}}(z_t, t, e_C)+\lambda L_{\text{MLM}}(C_{\text{masked}}) \] where \(L_{\text{Diff}}\) is the denoising loss, \(L_{\text{MLM}}\) is the masked language modeling loss, and \(\lambda\) is a weight parameter. Through this method, the author successfully solves the common over - fitting problem in text - to - image generation and significantly improves the quality and diversity of the generated images.