Prior Preserved Text-to-Image Personalization Without Image Regularization

Zhicai Wang,Ouxiang Li,Tan Wang,Longhui Wei,Yanbin Hao,Xiang Wang,Qi Tian
DOI: https://doi.org/10.1109/tcsvt.2024.3485236
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The current state-of-the-art text-to-image (T2I) models have found numerous applications, driven by their ability to produce photorealistic images. Concept learning, as one notable application, aims to enable T2I models to generate personalized content and better enable users to create images according to their interests. Nevertheless, the process of concept learning often involves model fine-tuning, which in turn brings the potential risk of overfitting. Such overfitting causes the T2I model to have reduced output diversity and results in poor editability. To mitigate the overfitting problem, we introduce two simple yet effective designs, namely masked textual inversion (MaskTI) and text regularization (TextReg). MaskTI is a variant of vanilla textual inversion that forces the learnable identifier to only attend to the class descriptor. This modification can effectively reduce the overfitting to those uninterested backgrounds. TextReg regulates the fine-tuning of cross-attention modules with simple text prompts without identifiers, which avoids the usage of real images as the regularization prior. Our extensive experiments demonstrate that not only does our approach effectively protect prior knowledge but also has high editability for the personalized model.
What problem does this paper attempt to address?