Abstract:Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to better align with the input text prompts while maintaining identity consistency in personalized text - to - image generation. Specifically, current methods often struggle to find a balance between maintaining identity consistency and text - prompt alignment. Some methods rely on a single text token to represent the subject, which limits the expressiveness; while other methods, although using richer representations, disrupt the prior knowledge of the model, thus undermining the text - to - image alignment. To solve these problems, this paper proposes the **Nested Attention mechanism**. This mechanism learns to select relevant subject features from the nested attention layer by introducing query - dependent subject values, thereby providing local, highly expressive representations for each region of the generated image. This method not only improves identity consistency but also preserves the prior knowledge of the model, making it possible to combine multiple personalized concepts into a single image. ### Main contributions 1. **Nested Attention mechanism**: By injecting rich and highly expressive image representations into the existing cross - attention layers, it solves the balance problem between identity consistency and text - prompt alignment. 2. **Localized expression**: The nested attention layer can generate query - dependent subject values, allowing the model to distributively encode semantic visual elements (such as eyes, mouth, etc.) during the generation process, instead of encoding the entire subject appearance into a single token. 3. **Multi - domain applicability**: This method is not only applicable to human face recognition but can also be applied to non - human domains, and does not require a specialized dataset and can be trained on a small dataset. ### Formula representation The output formula of the nested attention layer is as follows: \[ v^*_{q_{ij}}=\text{softmax}\left(\frac{q_{ij}\cdot\tilde{K}^T}{\sqrt{d}}\right)\tilde{V} \] where: - \( q_{ij} \) is the query vector of the spatial block (i, j) in the outer cross - attention layer. - \( \tilde{K} \) and \( \tilde{V} \) are the keys and values of the nested attention layer, parameterized by linear projections \( W_{\tilde{K}} \) and \( W_{\tilde{V}} \). - \( v^*_{q_{ij}} \) is the query - dependent value of the personalized token \( s^* \) at the spatial index (i, j). Finally, these query - dependent values are used in the outer cross - attention layer: \[ \phi^\ell_{\text{out}}(z_t)_{ij}=\text{softmax}\left(\frac{q_{ij}K^T}{\sqrt{d}}\right)V_{q_{ij}} \] where: \[ V_{q_{ij}}[s]= \begin{cases} v^*_{q_{ij}}, & \text{if } s = s^*\\ V[s], & \text{otherwise} \end{cases} \] In this way, the model can still bind all features to a single prompt token while maintaining a rich multi - token representation. ### Summary The nested attention mechanism proposed in this paper effectively solves the identity consistency and text - prompt alignment problems in personalized text - to - image generation, providing a better balance and higher expressiveness.

Nested Attention: Semantic-aware Attention Values for Concept Personalization