Transformer-based Image Generation from Scene Graphs

Renato Sortino,Simone Palazzo,Concetto Spampinato

2023-03-08

Abstract:Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at <a class="link-external link-https" href="https://github.com/perceivelab/trf-sg2im" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the problem of generating images based on scene graphs. Specifically, the researchers propose a method entirely based on the Transformer architecture to achieve the generation of images from scene graphs. Compared to previous methods based on Graph Convolutional Networks (GCNs) and adversarial training, this method utilizes multi-head attention mechanisms to encode scene graph information and employs a Transformer-based model in the latent space to generate images. This approach improves the quality and diversity of the generated images while avoiding the instability issues associated with adversarial model training. ### Specific Goals 1. **Improve Image Quality**: Enhance the quality of generated images by using multi-head attention mechanisms and Transformer models. 2. **Enhance Diversity**: Generate diverse images given the same scene graph. 3. **Increase Flexibility**: Better handle variations in the input scene graph. 4. **Improve Training Stability**: Avoid the training instability issues brought by adversarial models.

Transformer-based Image Generation from Scene Graphs

High-Quality Image Generation from Scene Graphs with Transformer

Iterative Scene Graph Generation with Generative Transformers

A Novel End-to-End Transformer for Scene Graph Generation

Attention Redirection Transformer with Semantic Oriented Learning for Unbiased Scene Graph Generation

Compositional transformers for scene generation

Learning Canonical Representations for Scene Graph to Image Generation

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

SGTR+: End-to-end Scene Graph Generation with Transformer

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

End-to-End Video Scene Graph Generation with Temporal Propagation Transformer

BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation

Learning Visual Commonsense for Robust Scene Graph Generation

Using Scene Graph Context to Improve Image Generation

Scene Graph Generation for Better Image Captioning?

Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

Generating Triples with Adversarial Networks for Scene Graph Construction

Rethinking Image Generation from Scene Graphs with Attention Mechanism

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

One-shot Scene Graph Generation