Abstract:Manipulating visual attributes of an image through a natural language description, known as text-to-image attributes manipulation (T2AM), is a challenging task. However, existing approaches tend to search the whole image to manipulate the target instance indicated by a description, thus they often fail to locate and manipulate the accurate text-relevant regions, and even disturb the text-irrelevant contents, e.g. texture and background. Meanwhile, the model efficiency needs to be improved. To tackle the above issues, we introduce a novel yet simple GAN-based approach, namely Structuring Image for Manipulating (SIMGAN), to narrow down the optimization areas from external to internal. It consists of two major components: 1) External Structuring (ExST), a pretrained segmentation network, for recognizing and separating the target instances and background from an image; and 2) Internal Structuring (InST) for seeking out and editing the text-relevant attributes of the target instances based on the given description and masked hierarchical image representations from ExST. Specifically, the InST structures target instances from outline to detail by firstly drawing the sketch and colors underpainting of instances with an Outline-Oriented Structuring (OuST), and then enhancing the text-relevant attributes and elaborating on details with a Detail-Oriented Structuring (DeST). Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art both quantitatively and qualitatively. Compared with the state-of-the-art method ManiGAN, our approach reduces the training time by 88%, while the inferring time is three times faster. In addition, our approach is easily extended to solve the instance-level image-to-image translation problem, and the results exhibit the versatility and effectiveness of our approach. This code is released in https://github.com/qikizh/SIMGAN .

OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

Tuning-Free Image Customization with Image and Text Guidance

Key-Locked Rank One Editing for Text-to-Image Personalization

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

Prior Preserved Text-to-Image Personalization Without Image Regularization

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

IntentTuner: An Interactive Framework for Integrating Human Intents in Fine-tuning Text-to-Image Generative Models

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization