ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Jiannan Huang,Jun Hao Liew,Hanshu Yan,Yuyang Yin,Yao Zhao,Yunchao Wei

2024-05-28

Abstract:Recent text-to-image customization works have been proven successful in generating images of given concepts by fine-tuning the diffusion models on a few examples. However, these methods tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g. headphone is missing when generating a <sks> dog wearing a headphone'). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g. a dog wearing a headphone) implying that the compositional ability only disappears after personalization tuning. Inspired by this observation, we present ClassDiffusion, a simple technique that leverages a semantic preservation loss to explicitly regulate the concept space when learning the new concept. Despite its simplicity, this helps avoid semantic drift when fine-tuning on the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of the fine-tune models. In response to the ineffective evaluation of CLIP-T metrics, we introduce BLIP2-T metric, a more equitable and effective evaluation metric for this particular domain. We also provide in-depth empirical study and theoretical analysis to better understand the role of the proposed loss. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem addressed in this paper is the loss of compositional ability in text-to-image generation models (such as Stable Diffusion) during personalized fine-tuning when generating specific concepts. Specifically, when the model is fine-tuned to generate specific concepts (e.g., a given dog wearing headphones), it tends to overfit these concepts, resulting in an inability to correctly combine multiple elements during generation (e.g., the dog is successfully generated, but the headphones are missing). Through experimental observations and theoretical analysis, the authors found that this decline in compositional ability stems from the drift of the target concept from its superclass semantics during fine-tuning. To address this issue, the authors propose the ClassDiffusion method, which explicitly regulates the concept space by introducing Semantic Preservation Loss (SPL) to avoid semantic drift during fine-tuning. Additionally, the paper introduces a new evaluation metric, BLIP2-T, to more fairly and effectively assess the alignment between text and image, and demonstrates the flexibility and effectiveness of ClassDiffusion in personalized video generation.

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Non-confusing Generation of Customized Concepts in Diffusion Models

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Multi-Concept Customization of Text-to-Image Diffusion

Learning to Customize Text-to-Image Diffusion In Diverse Context

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

Concept-centric Personalization with Large-scale Diffusion Priors

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Training Class-Imbalanced Diffusion Model Via Overlap Optimization

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

An Improved Method for Personalizing Diffusion Models

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Attention Calibration for Disentangled Text-to-Image Personalization

LCM-Lookahead for Encoder-based Text-to-Image Personalization

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

ChatDiff: A ChatGPT-based diffusion model for long-tailed classification

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models