BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

James Baker
2024-08-09
Abstract:Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at <a class="link-external link-https" href="https://github.com/jamesBaker361/tex_inv_plus" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the limitations of Textual Inversion technology in personalized diffusion models, especially its dependence on specific architectures such as UNet. Specifically, the paper focuses on the following points: 1. **Expand the application scope of Textual Inversion**: - **Support for non - UNet architectures**: Most existing Textual Inversion methods rely on the UNet architecture, and this paper attempts to apply Textual Inversion to other architectures, especially Vision Transformer. This enables Textual Inversion to be more widely applied in different model architectures. 2. **Optimize the effect of Textual Inversion**: - **Introduce BRAT (Bonus Orthogonal Token)**: To improve the effect of Textual Inversion, the author introduces a new token strategy - BRAT (Bonus oRthogonAl Token). By adding auxiliary pseudo - words and forcing these new embeddings to be orthogonal to the original embeddings, different aspects of the topic can be better captured. - **Improve consistency**: The introduction of BRAT not only improves the adherence to the source image, but also improves the adherence to the prompt and the human preference score. 3. **Reduce dependence on specific architectures**: - **Model - independent improvement**: Since many improvements in Textual Inversion are for the UNet architecture, the author hopes to find an improvement method that does not depend on a specific denoising model. BRAT is such a method, which can be applied in different types of denoising models, thus achieving broader applicability. ### Specific contributions of the paper - **Apply Textual Inversion to non - UNet architectures**: The paper shows how to apply Textual Inversion to Vision Transformer instead of the traditional UNet. - **Propose the BRAT method**: This is a new token strategy that can adapt to different denoising models and improve the quality of embeddings through orthogonal constraints. - **Verify the effect**: Through a series of experiments, it is proved that the BRAT method improves the content and style consistency, although in some cases it sacrifices some prompt similarity. ### Summary The main goal of this paper is to expand the application scope of Textual Inversion technology by introducing the BRAT method, so that it can be applied to more types of model architectures, and improve the quality and consistency of generated images by improving the embedding strategy. This not only solves the problem of the existing method's dependence on specific architectures, but also provides a new direction for future research.