Abstract:Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the challenges faced by existing personalization methods in extracting and decoupling fine-grained visual attributes (such as style, layout, texture). Specifically, the paper proposes a new personalization setting, namely User-Specified Visual Appearance Personalization (U-V AP), to learn specific visual attributes from a limited set of reference images and apply them to the generation of new concepts. ### Main Problem Background 1. **Limitations of Existing Methods**: - Existing personalization methods (such as DreamBooth) tend to overfit the entire object when handling visual attributes, failing to decouple visual features in the pixel space. - The lack of explicit external supervision makes it difficult to separate specific visual attributes. - Recent works, although making some progress in decoupling content and style, still lack precise attribute separation capabilities. 2. **User Needs**: - Users want to precisely control specific visual attributes, such as color and texture, when generating images. - Users need a flexible method that can combine these specific visual attributes with new concepts to generate high-quality images. ### Solution 1. **User-Specified Visual Appearance Personalization (U-V AP)**: - Allows users to select desired visual attributes through text instructions. - Utilizes an initial personalization model (based on DreamBooth) for pre-training, then generates target-related and non-target samples through a decoupled self-enhancement strategy to learn user-specified visual attributes. 2. **Decoupled Self-Enhancement Strategy**: - Uses large language models (LLM) to generate descriptions of target and non-target attributes. - Generates candidate images based on these descriptions and filters out target and non-target attribute sets through a data curation module. - Optimizes target and non-target identifiers during the training phase to further enhance the decoupling of target attributes. 3. **Semantic Adjustment**: - During the inference phase, adjusts semantic embeddings to further eliminate non-target attributes in the generated results. ### Experimental Validation 1. **Quantitative Evaluation**: - Uses CLIP and Inception Score to evaluate the prompt fidelity, image fidelity, and generation quality of the generated images. - Experimental results show that U-V AP exhibits the highest CLIP-T and Inception Score when learning colors and patterns. 2. **Qualitative Evaluation**: - Compares with existing personalization methods (such as DreamBooth, ProSpect, GPT-4V, etc.) to demonstrate the advantages of U-V AP in generating high-quality images and precisely controlling specific visual attributes. ### Summary This paper addresses the limitations of existing personalization methods in extracting and decoupling fine-grained visual attributes by proposing the U-V AP method. U-V AP allows users to precisely control specific visual attributes through text instructions and flexibly combine these attributes with new concepts to generate high-quality images. Experimental results show that U-V AP performs excellently in various visual attribute generation tasks, with high controllability and flexibility.

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Personalized Visual Vocabulary Adaption for Social Image Retrieval

Improving Diffusion Models for Virtual Try-on

Imagine yourself: Tuning-Free Personalized Image Generation

Hybrid CNN-Transformer based Meta-Learning Approach for Personalized Image Aesthetics Assessment

MyVLM: Personalizing VLMs for User-Specific Queries

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network

UF-VTON: Toward User-Friendly Virtual Try-On Network

Towards Personalized Aesthetic Image Caption.

VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation and Adaptation.

OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization