OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization

Ye Wang,Zili Yi,Rui Ma
2024-03-17
Abstract:Personalized text-to-image (T2I) models not only produce lifelike and varied visuals but also allow users to tailor the images to fit their personal taste. These personalization techniques can grasp the essence of a concept through a collection of images, or adjust a pre-trained text-to-image model with a specific image input for subject-driven or attribute-aware guidance. Yet, accurately capturing the distinct visual attributes of an individual image poses a challenge for these methods. To address this issue, we introduce OSTAF, a novel parameter-efficient one-shot fine-tuning method which only utilizes one reference image for T2I personalization. A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features (e.g., appearance, shape or drawing style) from the reference image. Comparing to existing image customization methods, our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately capture and apply specific visual attributes (such as appearance, shape or style) using a single reference image in personalized text - to - image (T2I) generation. Although existing personalization techniques can extract the conceptual essence in a set of images or adjust pre - trained T2I models through specific image inputs to achieve theme - driven or attribute - aware guidance, there are challenges in accurately capturing the unique visual attributes of a single image. To address this problem, the authors propose OSTAF (One - Shot Tuning for Attribute - Focused T2I Personalization), a novel parameter - efficient one - shot fine - tuning method that can achieve T2I personalization with only one reference image. This method introduces a lightweight hyper - network to adjust and optimize the weights of the U - Net encoder or decoder, thereby achieving accurate learning of various attribute features. Compared with existing methods, OSTAF shows significant advantages in attribute identification and application and achieves a good balance between efficiency and output quality.