Abstract:Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept. In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure specific components when personalizing visual concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of the concept and component. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal) to establish a balanced learning paradigm for desired visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to achieve precise and controllable personalization of specific visual concept components in text - to - image (T2I) diffusion models**. Specifically, although existing T2I models can generate high - quality images from text prompts, it is still difficult to achieve fine - grained control over each component of these concepts when generating images containing specific visual concepts. ### Main problems and challenges 1. **Semantic Pollution**: - When the model learns the visual semantics in the reference image, it may inadvertently introduce irrelevant visual elements, thus "polluting" the personalized concept. For example, when generating a person's image, other features that do not belong to this person may be mixed in. - Formula representation: \[ \text{Generated image} = f(\text{Reference image})+\epsilon \] where \(\epsilon\) represents irrelevant visual elements. 2. **Semantic Imbalance**: - The model may over - focus on certain aspects during the learning process, resulting in unbalanced learning of concepts or components. For example, the model may be more inclined to learn a complex roof rather than a simple tower. - Formula representation: \[ L_{\text{diff}}=\sum_{n,k}\left\|\epsilon\odot M'_{nk}-\epsilon_\theta(z^{(t)}_{nk},t,e_n)\odot M'_{nk}\right\|^2_2 \] where \(L_{\text{diff}}\) is the loss function used to measure the model's learning effect, \(\epsilon\) is the unscaled noise, \(z^{(t)}_{nk}\) is the noisy latent image with a random time step \(t\), \(e_n\) is the text embedding of the corresponding text prompt, and \(M'_{nk}\) is the mask obtained by down - sampling from the segmentation mask \(M_{nk}\). ### Solutions To solve the above problems, the paper proposes the **MagicTailor** framework, which includes two key techniques: 1. **Dynamic Masked Degradation (DM - Deg)**: - By dynamically introducing Gaussian noise in the reference image to suppress irrelevant visual semantics while maintaining the overall visual context. - Dynamic intensity formula: \[ \alpha_d=\alpha_{\text{init}}\left(1-\left(\frac{d}{D}\right)^\gamma\right) \] where \(\alpha_d\) is the dynamic weight, \(\alpha_{\text{init}}\) is the initial value, \(d\) is the current training step, \(D\) is the total training steps, and \(\gamma\) is a factor that adjusts the rate of decline. 2. **Dual - Stream Balancing (DS - Bal)**: - Through the dual - stream learning paradigm of online denoising U - Net and momentum denoising U - Net, ensure that the visual semantic learning of concepts and components is balanced. - Max - min optimization formula: \[ L_{\text{diff - max}}=\max_n\sum_{k,\epsilon,t,h}\left\|\epsilon\odot M'_{nk}-\epsilon_\theta(z^{(t)}_{nk},t,e_n)\odot M'_{nk}\right\|^2_2 \] Selective retention regularization formula: \[ L_{\text{

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

Magic Clothing: Controllable Garment-Driven Image Synthesis

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

Magic-Me: Identity-Specific Video Customized Diffusion

Attention Calibration for Disentangled Text-to-Image Personalization

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Create Your World: Lifelong Text-to-Image Diffusion

Multi-Concept Customization of Text-to-Image Diffusion

An Improved Method for Personalizing Diffusion Models

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

Learning to Customize Text-to-Image Diffusion In Diverse Context

DreamSteerer: Enhancing Source Image Conditioned Editability using Personalized Diffusion Models

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

Prior Preserved Text-to-Image Personalization Without Image Regularization

Imagic: Text-Based Real Image Editing with Diffusion Models

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement