Abstract:Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve two key problems in personalized Text - to - Image (T2I) generation: **instance fidelity** and **scene fidelity**. Specifically:
1. **Instance fidelity problem**:
- Existing T2I models often struggle to maintain the visual features of the main instances (such as specific objects, animals, etc.) provided by users when generating new images. For example, when generating a photo containing a specific dog, existing methods may not be able to accurately preserve the unique appearance of this dog.
2. **Scene fidelity problem**:
- When generating new scenes according to text prompts, existing models may ignore or mis - represent scene details. For example, when asked to generate a scene of "a dog in the rain", the model may not be able to correctly represent the details of the rain, such as raindrops and umbrellas.
To solve these problems, the paper proposes a new method named **ComFusion**. ComFusion improves the effect of T2I generation in the following ways:
- **Composite Stream**: Introduces the class - scene prior loss to maintain the pre - trained model's understanding of classes and scenes, thereby enhancing scene fidelity.
- **Fusion Stream**: Uses the visual - textual matching loss to effectively combine the visual information of the main instance with the text description of the scene, ensuring that the generated image is both faithful to the main instance and conforms to the scene description.
Through these two streams, ComFusion can generate high - quality personalized images in diverse scenes while maintaining high instance and scene fidelity.
### Formula summary
- **Instance Finetune Loss**:
\[
L_{I}^{C}=\mathbb{E}_{z \sim \{z_I\}, \epsilon, t}\left[\left\|\epsilon-\epsilon_{\theta}(z_t, t, \Gamma(T_I))\right\|_2^2\right]
\]
where \(z_t\) is the noisy latent variable at time step \(t\), \(\epsilon\) is the unscaled noise sampled from a Gaussian distribution, and \(z_I\) is the latent variable of the instance image.
- **Class - Scene Prior Loss**:
\[
L_{S}^{C}=\mathbb{E}_{(z, T) \sim \{(z_{CS}, T_{CS})\}, \epsilon, t}\left[\left\|\epsilon-\epsilon_{\theta}(z, t, \Gamma(T))\right\|_2^2\right]
\]
where \((z_{CS}, T_{CS})\) is the latent variable - text pair of the class - scene prior image and text.
- **Visual - Textual Matching Loss**:
\[
L_{I}^{F}=\mathbb{E}_{x \sim \{\tilde{x}_{IS}^k\}}\left[-DINO(x, x_I)\right]
\]
\[
L_{S}^{F}=\mathbb{E}_{(x', T) \sim \{(\tilde{x}_{IS}^k, T_{CS}^k)\}}\left[-CLIP(x', T)\right]
\]
- **Total Objective Function**