Abstract:Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.

What problem does this paper attempt to address?

The paper aims to address the issue of computational inefficiency in the adaptation of large text-to-image models to smaller datasets or new visual concepts. Specifically, it focuses on speeding up the training process of popular text-to-image personalization methods like Textual Inversion and DreamBooth, which can currently take a prohibitively long time. ### Problem Statement The main problem addressed in the paper is the long training time required for efficient personalization (or adaptation) of text-to-image models. Despite the success of these models in generating high-quality and diverse images corresponding to user prompts, the adaptation process to new concepts or datasets is computationally expensive and time-consuming. This limits their practical applications and slows down experimentation. ### Key Findings - **Training Dynamics:** The paper observes that most visual concepts are learned at early stages of training, and the quality does not significantly improve later. However, standard training convergence metrics fail to indicate this saturation point. - **Deterministic Loss:** The authors propose a deterministic variant of the training loss (`Ldet`), computed on a fixed set of inputs throughout training, which better reflects the convergence of concept embeddings. - **Early Stopping Criterion:** Based on the insights gained from analyzing the training dynamics, the authors propose a simple early stopping criterion called Deterministic Variance Evaluation.

Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics

Emage: Non-Autoregressive Text-to-Image Generation

Key-Locked Rank One Editing for Text-to-Image Personalization

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Controllable Textual Inversion for Personalized Text-to-Image Generation

Multi-Concept Customization of Text-to-Image Diffusion

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Gradient-Free Textual Inversion

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Fast Personalized Text to Image Synthesis with Attention Injection

Prior Preserved Text-to-Image Personalization Without Image Regularization

Customization Assistant for Text-to-image Generation

TextCraftor: Your Text Encoder Can be Image Quality Controller

LCM-Lookahead for Encoder-based Text-to-Image Personalization

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance