Abstract:While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the issue of factual consistency in generating natural - scene images, especially the accurate modeling of multiple objects and their relationships in complex scenes. Specifically, the paper points out that current image - generation models have the following deficiencies: 1. **Limitations of Existing Evaluation Metrics**: Traditional evaluation metrics such as Fréchet Inception Distance (FID) and CLIPScore mainly focus on image quality, but perform poorly in capturing factual consistency in complex scenes (such as the existence of objects and the relationships between them). For example, CLIPScore may give a high score to an image that contains all relevant objects but has incorrect relationships, while SGScore can more effectively identify these nuances. 2. **Lack of Large - Scale Labeled Datasets**: Existing scene - graph datasets are small in scale and unevenly distributed, unable to fairly evaluate the performance of generation models in diverse and complex scenes. This limits the ability to conduct a comprehensive and fair comparison of models. To solve these problems, the paper makes the following key contributions: - **Scene - Bench Benchmarking Framework**: It includes a large - scale dataset named MegaSG, which contains 1 million images with scene - graph annotations, and a new evaluation metric SGScore. SGScore conducts chain - of - thought reasoning through multi - modal large language models (LLMs) to evaluate the accuracy of the existence and relationships of objects in the generated image, thereby providing a more effective measure of factual consistency. - **Scene - Graph Feedback Mechanism**: Based on the evaluation results of the scene - graph, a feedback pipeline for iteratively improving the generated image is designed. By identifying and correcting the differences between the generated image and the input scene - graph, the factual consistency of the generated image is significantly improved. ### Formula Summary - **Scene Complexity Calculation Formula**: \[ C(G)=\gamma\cdot|V|+(1.0 - \gamma)\cdot|E| \] where \(V\) is the set of nodes (objects), \(E\) is the set of edges (relationships), and \(\gamma\in[0, 1]\) is a weighting factor. - **Object Recall Calculation Formula**: \[ \text{ObjectRecall}(G, I)=\frac{|V_{\text{pred}}\cap V_{\text{gt}}|}{|V_{\text{gt}}|} \] where \(V_{\text{pred}}\) is the set of objects identified by the LLM, and \(V_{\text{gt}}\) is the set of real objects in the original scene - graph. - **Relation Recall Calculation Formula**: \[ \text{RelationRecall}(G, I)=\frac{|E_{\text{pred}}\cap E_{\text{gt}}|}{|E_{\text{gt}}|} \] where \(E_{\text{pred}}\) is the set of predicted relationships in the generated scene, and \(E_{\text{gt}}\) is the set of real relationships in the original scene - graph. - **SGScore Comprehensive Evaluation Formula**: \[ \text{SGScore}(G, I)=\alpha\cdot\text{ObjectRecall}(G, I)+(1.0 - \alpha)\cdot\text{RelationRecall}(G, I) \] where \(\alpha\in[0, 1]\) is a hyperparameter used to control the relative importance of object recall rate and relation recall rate. Through these methods and formulas, the paper provides a more comprehensive and effective evaluation framework, especially in complex - scene generation, which significantly improves the factual consistency of generated images.

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

A Comprehensive Survey of Scene Graphs: Generation and Application

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Rethinking the Evaluation of Unbiased Scene Graph Generation

From General to Specific: Informative Scene Graph Generation Via Balance Adjustment

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Scene Graph Generation: A Comprehensive Survey

Scene Graph Modification as Incremental Structure Expanding

MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Scene Graph Generation for Better Image Captioning?

Learning Object Consistency and Interaction in Image Generation from Scene Graphs

Learning to Generate Scene Graph from Natural Language Supervision

Learning to Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Using Scene Graph Context to Improve Image Generation

A Review and Efficient Implementation of Scene Graph Generation Metrics