Abstract:Manipulating visual attributes of an image through a natural language description, known as text-to-image attributes manipulation (T2AM), is a challenging task. However, existing approaches tend to search the whole image to manipulate the target instance indicated by a description, thus they often fail to locate and manipulate the accurate text-relevant regions, and even disturb the text-irrelevant contents, e.g. texture and background. Meanwhile, the model efficiency needs to be improved. To tackle the above issues, we introduce a novel yet simple GAN-based approach, namely Structuring Image for Manipulating (SIMGAN), to narrow down the optimization areas from external to internal. It consists of two major components: 1) External Structuring (ExST), a pretrained segmentation network, for recognizing and separating the target instances and background from an image; and 2) Internal Structuring (InST) for seeking out and editing the text-relevant attributes of the target instances based on the given description and masked hierarchical image representations from ExST. Specifically, the InST structures target instances from outline to detail by firstly drawing the sketch and colors underpainting of instances with an Outline-Oriented Structuring (OuST), and then enhancing the text-relevant attributes and elaborating on details with a Detail-Oriented Structuring (DeST). Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art both quantitatively and qualitatively. Compared with the state-of-the-art method ManiGAN, our approach reduces the training time by 88%, while the inferring time is three times faster. In addition, our approach is easily extended to solve the instance-level image-to-image translation problem, and the results exhibit the versatility and effectiveness of our approach. This code is released in https://github.com/qikizh/SIMGAN .

Text-Guided Human Image Manipulation Via Image-Text Shared Space

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Towards Interactive Facial Image Inpainting by Text or Exemplar Image.

Text Guided Person Image Synthesis

Action-based image editing guided by human instructions

ManiTrans: Entity-Level Text-Guided Image Manipulation Via Token-wise Semantic Alignment and Generation

Towards Arbitrary Text-driven Image Manipulation Via Space Alignment

Entity-Level Text-Guided Image Manipulation.

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Text-Guided Mask-free Local Image Retouching

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

Text Guided Image Editing with Automatic Concept Locating and Forgetting

Text-Driven Image Editing via Learnable Regions

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

Text-guided Eyeglasses Manipulation with Spatial Constraints

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Mask-guided GAN for robust text editing in the scene