Multi-Concept Customization of Text-to-Image Diffusion

Nupur Kumari,Bingliang Zhang,Richard Zhang,Eli Shechtman,Jun-Yan Zhu
2023-06-21
Abstract:While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms or performs on par with several baselines and concurrent works in both qualitative and quantitative evaluations while being memory and computationally efficient.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enable existing text - to - image generation models to quickly learn new concepts and be able to seamlessly combine these new concepts with existing concepts when generating new images. Specifically, the paper focuses on the following challenges: 1. **Model Forgetting**: When adding new concepts to the model, avoid the model forgetting or changing the meaning of the concepts it has already learned. For example, when adding the concept of "moon gate", it should not lead to the loss of the concept of "moon". 2. **Overfitting**: Due to the limited number of training samples for new concepts, the model is prone to overfit these small number of training samples, thus reducing the variety of generated images. 3. **Multi - Concept Combination**: Be able to not only learn new concepts individually, but also combine multiple new concepts together to generate complex scenes. For example, generate an image of a pet dog wearing sunglasses standing in front of a moon gate. To address these challenges, the paper proposes the **Custom Diffusion** method. By optimizing some parameters in the text - to - image generation model (mainly the key - value mapping in the cross - attention layer), it achieves the ability to efficiently learn new concepts and combine them with existing concepts. This method not only performs well in single - concept learning, but also achieves remarkable results in multi - concept combination generation.