Abstract:Conditioning image generation on specific features of the desired output is a key ingredient of modern generative models. However, existing approaches lack a general and unified way of representing structural and semantic conditioning at diverse granularity levels. This paper explores a novel method to condition image generation, based on object-centric relational representations. In particular, we propose a methodology to condition the generation of objects in an image on the attributed graph representing their structure and the associated semantic information. We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process and allow for regularizing the training procedure. The proposed conditioning framework is implemented by means of a neural network that learns to generate a 2D, multi-channel, layout mask of the objects, which can be used as a soft inductive bias in the downstream generative task. To do so, we leverage both 2D and graph convolutional operators. We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation. Empirical results show that the proposed approach compares favorably against relevant baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing image generation methods lack a general and unified way to represent structural and semantic information in conditional settings, especially at different granularity levels. Specifically, when dealing with image generation, although existing methods can perform conditional settings based on specific features (such as objects and relationships in a scene graph), these methods are usually unable to flexibly handle fine - grained attributes (such as spatial position, arrangement, or visual attributes), and often rely on fixed template structures, lacking flexibility. To overcome these problems, this paper proposes a new method to condition image generation based on object - centered relationship representation. The core of this method is to use an attributed graph to represent the structure of objects in an image and their related semantic information, and to learn to generate a 2D multi - channel layout mask through a neural network. This mask can be used as a soft inductive bias in downstream generation tasks. In this way, this method can not only manipulate and condition the generation process more flexibly, but also play a role of regularization during the training process, improving the generalization ability of the model. ### Main Contributions 1. **Propose a new method**: Condition image generation based on object - centered relationship representation, which solves the limitations of existing methods in handling fine - grained attributes and structural information. 2. **Specific implementation**: Provide a specific implementation framework, including pre - training and end - to - end training steps, enabling the model to be transferred in different tasks. 3. **Benchmark test**: Introduce a new benchmark dataset (Pose - Representable Objects, PRO) to evaluate the performance of the generation model in the conditional image generation task, especially in handling the fine - grained structure and semantic information of objects. ### Method Overview - **Input**: An attributed graph \(G\), where nodes represent key points of objects and edges represent relationships between key points. - **Output**: A layout mask \(L\in[0, 1]^{C\times H\times W}\), which is used to condition the downstream generation model. - **Core components**: - **Encoder \(\Phi_{\phi}\)**: Based on graph convolution operations, learn the representation of nodes. - **Mask generator \(\mu_{\theta}\)**: Generate the local mask \(M_{i}\) and feature vector \(f_{i}\) for each node. - **Layout mask generation**: Generate the final layout mask \(L\) by aggregating the local masks and feature vectors of all nodes. ### Experimental Results - **Synthetic dataset PRO**: Demonstrate the advantages of the model in generating images with fine - grained structure and semantic information. - **Real - world dataset Humans**: Demonstrate the performance of the model in generating human body images, especially in handling generation tasks under key - point conditions. ### Conclusion The method proposed in this paper effectively solves the limitations of existing image generation methods in conditional settings by introducing object - centered relationship representation, and improves the quality and flexibility of generated images. The experimental results show that this method performs excellently on multiple benchmark datasets and has broad application prospects.

Object-Centric Relational Representations for Image Generation

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

Learning Structured Output Representations from Attributes using Deep Conditional Generative Models

From Rule-Based to Learning-Based Image-Conditional Image Generation

Spatially Constrained Generative Adversarial Networks for Conditional Image Generation

Return of Unconditional Generation: A Self-supervised Representation Generation Method

3D-aware Image Generation and Editing with Multi-modal Conditions

Learning Object Consistency and Interaction in Image Generation from Scene Graphs

Affect-Conditioned Image Generation

Conditional Generation from Unconditional Diffusion Models using Denoiser Representations

Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs

A Simple Approach to Unifying Diffusion-based Conditional Generation

Ways of Conditioning Generative Adversarial Networks

Conditional Image Generation Using Feature-Matching Gan

GANs Conditioning Methods: A Survey

Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation

Multilinear Latent Conditioning for Generating Unseen Attribute Combinations

Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts

Image Generators with Conditionally-Independent Pixel Synthesis

Controllable Image Generation via Collage Representations