Abstract:Manipulating visual attributes of an image through a natural language description, known as text-to-image attributes manipulation (T2AM), is a challenging task. However, existing approaches tend to search the whole image to manipulate the target instance indicated by a description, thus they often fail to locate and manipulate the accurate text-relevant regions, and even disturb the text-irrelevant contents, e.g. texture and background. Meanwhile, the model efficiency needs to be improved. To tackle the above issues, we introduce a novel yet simple GAN-based approach, namely Structuring Image for Manipulating (SIMGAN), to narrow down the optimization areas from external to internal. It consists of two major components: 1) External Structuring (ExST), a pretrained segmentation network, for recognizing and separating the target instances and background from an image; and 2) Internal Structuring (InST) for seeking out and editing the text-relevant attributes of the target instances based on the given description and masked hierarchical image representations from ExST. Specifically, the InST structures target instances from outline to detail by firstly drawing the sketch and colors underpainting of instances with an Outline-Oriented Structuring (OuST), and then enhancing the text-relevant attributes and elaborating on details with a Detail-Oriented Structuring (DeST). Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art both quantitatively and qualitatively. Compared with the state-of-the-art method ManiGAN, our approach reduces the training time by 88%, while the inferring time is three times faster. In addition, our approach is easily extended to solve the instance-level image-to-image translation problem, and the results exhibit the versatility and effectiveness of our approach. This code is released in https://github.com/qikizh/SIMGAN .

Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Diversified text-to-image generation via deep mutual information estimation

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Improving text-to-image generation with object layout guidance

Layout-Bridging Text-to-Image Synthesis

Object-driven Text-to-Image Synthesis via Adversarial Training

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

DT2I: Dense Text-to-Image Generation from Region Descriptions

A survey of generative adversarial networks and their application in text-to-image synthesis

Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Text-to-Image Generation for Abstract Concepts

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

T2TD: Text-3D Generation Model Based on Prior Knowledge Guidance

Feature-Grounded Single-Stage Text-to-Image Generation

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions