Abstract:Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at <a class="link-external link-https" href="https://github.com/jamesBaker361/tex_inv_plus" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the limitations of Textual Inversion technology in personalized diffusion models, especially its dependence on specific architectures such as UNet. Specifically, the paper focuses on the following points: 1. **Expand the application scope of Textual Inversion**: - **Support for non - UNet architectures**: Most existing Textual Inversion methods rely on the UNet architecture, and this paper attempts to apply Textual Inversion to other architectures, especially Vision Transformer. This enables Textual Inversion to be more widely applied in different model architectures. 2. **Optimize the effect of Textual Inversion**: - **Introduce BRAT (Bonus Orthogonal Token)**: To improve the effect of Textual Inversion, the author introduces a new token strategy - BRAT (Bonus oRthogonAl Token). By adding auxiliary pseudo - words and forcing these new embeddings to be orthogonal to the original embeddings, different aspects of the topic can be better captured. - **Improve consistency**: The introduction of BRAT not only improves the adherence to the source image, but also improves the adherence to the prompt and the human preference score. 3. **Reduce dependence on specific architectures**: - **Model - independent improvement**: Since many improvements in Textual Inversion are for the UNet architecture, the author hopes to find an improvement method that does not depend on a specific denoising model. BRAT is such a method, which can be applied in different types of denoising models, thus achieving broader applicability. ### Specific contributions of the paper - **Apply Textual Inversion to non - UNet architectures**: The paper shows how to apply Textual Inversion to Vision Transformer instead of the traditional UNet. - **Propose the BRAT method**: This is a new token strategy that can adapt to different denoising models and improve the quality of embeddings through orthogonal constraints. - **Verify the effect**: Through a series of experiments, it is proved that the BRAT method improves the content and style consistency, although in some cases it sacrifices some prompt similarity. ### Summary The main goal of this paper is to expand the application scope of Textual Inversion technology by introducing the BRAT method, so that it can be applied to more types of model architectures, and improve the quality and consistency of generated images by improving the embedding strategy. This not only solves the problem of the existing method's dependence on specific architectures, but also provides a new direction for future research.

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

Controllable Textual Inversion for Personalized Text-to-Image Generation

Gradient-Free Textual Inversion

IterInv: Iterative Inversion for Pixel-Level T2I Models

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis

P+: Extended Textual Conditioning in Text-to-Image Generation

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation

Language Model Inversion

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Backdooring Textual Inversion for Concept Censorship

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication

Inversion-Based Style Transfer with Diffusion Models

Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method