Abstract:Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require heavy test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.

What problem does this paper attempt to address?

The paper aims to address the problem of personalized text-to-image generation, specifically how to achieve this goal without the need for test-time finetuning. Specifically, the paper proposes a new method called InstantBooth, which builds on a pre-trained text-to-image model and is capable of instantly generating personalized, high-quality image variations based on different text prompts without any test-time finetuning. The paper mainly addresses the following issues: 1. **Avoiding test-time finetuning**: Existing personalized image generation methods typically require extensive test-time finetuning for each new concept, which is both time-consuming and difficult to scale. InstantBooth aims to achieve personalized image generation without any test-time finetuning. 2. **Improving efficiency and scalability**: By eliminating the need for test-time finetuning, InstantBooth increases the speed of the generation process and reduces storage requirements, thereby significantly improving the method's efficiency and scalability. 3. **Maintaining high-quality generation results**: Despite avoiding test-time finetuning, InstantBooth is still able to generate high-quality image results comparable to existing methods based on test-time finetuning, excelling in language-image alignment, image fidelity, and identity preservation. To achieve the above goals, the paper employs several key components and techniques: - **Concept embedding learning**: By learning a trainable image encoder that converts input images into compact text embeddings, representing the general concept of the input image. - **Introducing adapter layers**: To retain more detailed information, the paper introduces several trainable adapter layers to extract rich visual feature representations from the input image and inject them into the pre-trained model, preserving identity details without losing language controllability. - **Efficient training strategy**: Training only with text-image pairs without requiring paired same-concept images, allowing the model to generalize to unseen concepts. In summary, InstantBooth is a new method aimed at improving the efficiency and effectiveness of personalized text-to-image generation tasks, overcoming the limitations of existing methods through innovative design.

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

InstructBooth: Instruction-following Personalized Text-to-Image Generation

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

FaceChain: A Playground for Identity-Preserving Portrait Generation

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Emage: Non-Autoregressive Text-to-Image Generation

Fast Personalized Text to Image Synthesis with Attention Injection

DreamBooth3D: Subject-Driven Text-to-3D Generation

Fast Personalized Text-to-Image Syntheses With Attention Injection

MultiBooth: Towards Generating All Your Concepts in an Image from Text

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Instant3D: Instant Text-to-3D Generation

GroundingBooth: Grounding Text-to-Image Customization

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models

VideoBooth: Diffusion-based Video Generation with Image Prompts