Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics

Anton Voronov,Mikhail Khoroshikh,Artem Babenko,Max Ryabinin
2023-11-02
Abstract:Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of computational inefficiency in the adaptation of large text-to-image models to smaller datasets or new visual concepts. Specifically, it focuses on speeding up the training process of popular text-to-image personalization methods like Textual Inversion and DreamBooth, which can currently take a prohibitively long time. ### Problem Statement The main problem addressed in the paper is the long training time required for efficient personalization (or adaptation) of text-to-image models. Despite the success of these models in generating high-quality and diverse images corresponding to user prompts, the adaptation process to new concepts or datasets is computationally expensive and time-consuming. This limits their practical applications and slows down experimentation. ### Key Findings - **Training Dynamics:** The paper observes that most visual concepts are learned at early stages of training, and the quality does not significantly improve later. However, standard training convergence metrics fail to indicate this saturation point. - **Deterministic Loss:** The authors propose a deterministic variant of the training loss (`Ldet`), computed on a fixed set of inputs throughout training, which better reflects the convergence of concept embeddings. - **Early Stopping Criterion:** Based on the insights gained from analyzing the training dynamics, the authors propose a simple early stopping criterion called Deterministic Variance Evaluation.