ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Jiannan Huang,Jun Hao Liew,Hanshu Yan,Yuyang Yin,Yao Zhao,Yunchao Wei
2024-05-28
Abstract:Recent text-to-image customization works have been proven successful in generating images of given concepts by fine-tuning the diffusion models on a few examples. However, these methods tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g. headphone is missing when generating a <sks> dog wearing a headphone'). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g. a dog wearing a headphone) implying that the compositional ability only disappears after personalization tuning. Inspired by this observation, we present ClassDiffusion, a simple technique that leverages a semantic preservation loss to explicitly regulate the concept space when learning the new concept. Despite its simplicity, this helps avoid semantic drift when fine-tuning on the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of the fine-tune models. In response to the ineffective evaluation of CLIP-T metrics, we introduce BLIP2-T metric, a more equitable and effective evaluation metric for this particular domain. We also provide in-depth empirical study and theoretical analysis to better understand the role of the proposed loss. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem addressed in this paper is the loss of compositional ability in text-to-image generation models (such as Stable Diffusion) during personalized fine-tuning when generating specific concepts. Specifically, when the model is fine-tuned to generate specific concepts (e.g., a given dog wearing headphones), it tends to overfit these concepts, resulting in an inability to correctly combine multiple elements during generation (e.g., the dog is successfully generated, but the headphones are missing). Through experimental observations and theoretical analysis, the authors found that this decline in compositional ability stems from the drift of the target concept from its superclass semantics during fine-tuning. To address this issue, the authors propose the ClassDiffusion method, which explicitly regulates the concept space by introducing Semantic Preservation Loss (SPL) to avoid semantic drift during fine-tuning. Additionally, the paper introduces a new evaluation metric, BLIP2-T, to more fairly and effectively assess the alignment between text and image, and demonstrates the flexibility and effectiveness of ClassDiffusion in personalized video generation.