InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi,Wei Xiong,Zhe Lin,Hyun Joon Jung
2023-04-07
Abstract:Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require heavy test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of personalized text-to-image generation, specifically how to achieve this goal without the need for test-time finetuning. Specifically, the paper proposes a new method called InstantBooth, which builds on a pre-trained text-to-image model and is capable of instantly generating personalized, high-quality image variations based on different text prompts without any test-time finetuning. The paper mainly addresses the following issues: 1. **Avoiding test-time finetuning**: Existing personalized image generation methods typically require extensive test-time finetuning for each new concept, which is both time-consuming and difficult to scale. InstantBooth aims to achieve personalized image generation without any test-time finetuning. 2. **Improving efficiency and scalability**: By eliminating the need for test-time finetuning, InstantBooth increases the speed of the generation process and reduces storage requirements, thereby significantly improving the method's efficiency and scalability. 3. **Maintaining high-quality generation results**: Despite avoiding test-time finetuning, InstantBooth is still able to generate high-quality image results comparable to existing methods based on test-time finetuning, excelling in language-image alignment, image fidelity, and identity preservation. To achieve the above goals, the paper employs several key components and techniques: - **Concept embedding learning**: By learning a trainable image encoder that converts input images into compact text embeddings, representing the general concept of the input image. - **Introducing adapter layers**: To retain more detailed information, the paper introduces several trainable adapter layers to extract rich visual feature representations from the input image and inject them into the pre-trained model, preserving identity details without losing language controllability. - **Efficient training strategy**: Training only with text-image pairs without requiring paired same-concept images, allowing the model to generalize to unseen concepts. In summary, InstantBooth is a new method aimed at improving the efficiency and effectiveness of personalized text-to-image generation tasks, overcoming the limitations of existing methods through innovative design.