U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

You Wu,Kean Liu,Xiaoyue Mi,Fan Tang,Juan Cao,Jintao Li
2024-03-29
Abstract:Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges faced by existing personalization methods in extracting and decoupling fine-grained visual attributes (such as style, layout, texture). Specifically, the paper proposes a new personalization setting, namely User-Specified Visual Appearance Personalization (U-V AP), to learn specific visual attributes from a limited set of reference images and apply them to the generation of new concepts. ### Main Problem Background 1. **Limitations of Existing Methods**: - Existing personalization methods (such as DreamBooth) tend to overfit the entire object when handling visual attributes, failing to decouple visual features in the pixel space. - The lack of explicit external supervision makes it difficult to separate specific visual attributes. - Recent works, although making some progress in decoupling content and style, still lack precise attribute separation capabilities. 2. **User Needs**: - Users want to precisely control specific visual attributes, such as color and texture, when generating images. - Users need a flexible method that can combine these specific visual attributes with new concepts to generate high-quality images. ### Solution 1. **User-Specified Visual Appearance Personalization (U-V AP)**: - Allows users to select desired visual attributes through text instructions. - Utilizes an initial personalization model (based on DreamBooth) for pre-training, then generates target-related and non-target samples through a decoupled self-enhancement strategy to learn user-specified visual attributes. 2. **Decoupled Self-Enhancement Strategy**: - Uses large language models (LLM) to generate descriptions of target and non-target attributes. - Generates candidate images based on these descriptions and filters out target and non-target attribute sets through a data curation module. - Optimizes target and non-target identifiers during the training phase to further enhance the decoupling of target attributes. 3. **Semantic Adjustment**: - During the inference phase, adjusts semantic embeddings to further eliminate non-target attributes in the generated results. ### Experimental Validation 1. **Quantitative Evaluation**: - Uses CLIP and Inception Score to evaluate the prompt fidelity, image fidelity, and generation quality of the generated images. - Experimental results show that U-V AP exhibits the highest CLIP-T and Inception Score when learning colors and patterns. 2. **Qualitative Evaluation**: - Compares with existing personalization methods (such as DreamBooth, ProSpect, GPT-4V, etc.) to demonstrate the advantages of U-V AP in generating high-quality images and precisely controlling specific visual attributes. ### Summary This paper addresses the limitations of existing personalization methods in extracting and decoupling fine-grained visual attributes by proposing the U-V AP method. U-V AP allows users to precisely control specific visual attributes through text instructions and flexibly combine these attributes with new concepts to generate high-quality images. Experimental results show that U-V AP performs excellently in various visual attribute generation tasks, with high controllability and flexibility.