Abstract:Manipulating visual attributes of an image through a natural language description, known as text-to-image attributes manipulation (T2AM), is a challenging task. However, existing approaches tend to search the whole image to manipulate the target instance indicated by a description, thus they often fail to locate and manipulate the accurate text-relevant regions, and even disturb the text-irrelevant contents, e.g. texture and background. Meanwhile, the model efficiency needs to be improved. To tackle the above issues, we introduce a novel yet simple GAN-based approach, namely Structuring Image for Manipulating (SIMGAN), to narrow down the optimization areas from external to internal. It consists of two major components: 1) External Structuring (ExST), a pretrained segmentation network, for recognizing and separating the target instances and background from an image; and 2) Internal Structuring (InST) for seeking out and editing the text-relevant attributes of the target instances based on the given description and masked hierarchical image representations from ExST. Specifically, the InST structures target instances from outline to detail by firstly drawing the sketch and colors underpainting of instances with an Outline-Oriented Structuring (OuST), and then enhancing the text-relevant attributes and elaborating on details with a Detail-Oriented Structuring (DeST). Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art both quantitatively and qualitatively. Compared with the state-of-the-art method ManiGAN, our approach reduces the training time by 88%, while the inferring time is three times faster. In addition, our approach is easily extended to solve the instance-level image-to-image translation problem, and the results exhibit the versatility and effectiveness of our approach. This code is released in https://github.com/qikizh/SIMGAN .

Fully Functional Image Manipulation Using Scene Graphs in A Bounding-Box FreeWay

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Semantic Image Manipulation Using Scene Graphs

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches

Complex Scene Image Editing by Scene Graph Comprehension

Learning Object Consistency and Interaction in Image Generation from Scene Graphs

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Box-FaceS: A Bidirectional Method for Box-Guided Face Component Editing

Image Synthesis from Layout with Locality-Aware Mask Adaption

Exploiting Relationship for Complex-scene Image Generation

Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models

Image Generation from Scene Graph with Object Edges

Exemplar-based Generative Facial Editing

Data-Driven Object Manipulation in Images

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Generative Photomontage

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

AnyScene: Customized Image Synthesis with Composited Foreground

DeformSg2im: Scene Graph Based Multi-Instance Image Generation with a Deformable Geometric Layout.