MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Donghao Zhou,Jiancheng Huang,Jinbin Bai,Jiaze Wang,Hao Chen,Guangyong Chen,Xiaowei Hu,Pheng-Ann Heng
2024-10-17
Abstract:Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept. In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure specific components when personalizing visual concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of the concept and component. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal) to establish a balanced learning paradigm for desired visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to achieve precise and controllable personalization of specific visual concept components in text - to - image (T2I) diffusion models**. Specifically, although existing T2I models can generate high - quality images from text prompts, it is still difficult to achieve fine - grained control over each component of these concepts when generating images containing specific visual concepts. ### Main problems and challenges 1. **Semantic Pollution**: - When the model learns the visual semantics in the reference image, it may inadvertently introduce irrelevant visual elements, thus "polluting" the personalized concept. For example, when generating a person's image, other features that do not belong to this person may be mixed in. - Formula representation: \[ \text{Generated image} = f(\text{Reference image})+\epsilon \] where \(\epsilon\) represents irrelevant visual elements. 2. **Semantic Imbalance**: - The model may over - focus on certain aspects during the learning process, resulting in unbalanced learning of concepts or components. For example, the model may be more inclined to learn a complex roof rather than a simple tower. - Formula representation: \[ L_{\text{diff}}=\sum_{n,k}\left\|\epsilon\odot M'_{nk}-\epsilon_\theta(z^{(t)}_{nk},t,e_n)\odot M'_{nk}\right\|^2_2 \] where \(L_{\text{diff}}\) is the loss function used to measure the model's learning effect, \(\epsilon\) is the unscaled noise, \(z^{(t)}_{nk}\) is the noisy latent image with a random time step \(t\), \(e_n\) is the text embedding of the corresponding text prompt, and \(M'_{nk}\) is the mask obtained by down - sampling from the segmentation mask \(M_{nk}\). ### Solutions To solve the above problems, the paper proposes the **MagicTailor** framework, which includes two key techniques: 1. **Dynamic Masked Degradation (DM - Deg)**: - By dynamically introducing Gaussian noise in the reference image to suppress irrelevant visual semantics while maintaining the overall visual context. - Dynamic intensity formula: \[ \alpha_d=\alpha_{\text{init}}\left(1-\left(\frac{d}{D}\right)^\gamma\right) \] where \(\alpha_d\) is the dynamic weight, \(\alpha_{\text{init}}\) is the initial value, \(d\) is the current training step, \(D\) is the total training steps, and \(\gamma\) is a factor that adjusts the rate of decline. 2. **Dual - Stream Balancing (DS - Bal)**: - Through the dual - stream learning paradigm of online denoising U - Net and momentum denoising U - Net, ensure that the visual semantic learning of concepts and components is balanced. - Max - min optimization formula: \[ L_{\text{diff - max}}=\max_n\sum_{k,\epsilon,t,h}\left\|\epsilon\odot M'_{nk}-\epsilon_\theta(z^{(t)}_{nk},t,e_n)\odot M'_{nk}\right\|^2_2 \] Selective retention regularization formula: \[ L_{\text{