Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

Jaemin Cho,Yushi Hu,Roopal Garg,Peter Anderson,Ranjay Krishna,Jason Baldridge,Mohit Bansal,Jordi Pont-Tuset,Su Wang

2024-03-14

Abstract:Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the reliability challenge in the evaluation of text - to - image generation models (Text - to - Image, T2I). Specifically, the existing evaluation methods based on Question Generation and Answering (QG/A) have several major problems: 1. **Question Generation (QG) stage**: - **Hallucination problem**: The generated questions may contain information that does not exist in the original text. - **Repetition problem**: The generated questions may be repeated, leading to bias in the evaluation results. - **Omission problem**: The generated questions may omit certain key information in the original text. - **Non - atomic problem**: The generated questions may contain multiple details, making the answers difficult to interpret. 2. **Visual Question Answering (VQA) stage**: - **Inconsistent answers**: For the same object, the VQA model may give contradictory answers to different questions. For example, a model may first say "There is no motorcycle" and then say "The motorcycle is blue". To address these problems, the paper proposes the Davidsonian Scene Graph (DSG) framework, which improves the QG/A evaluation method in the following ways: - **Atomic questions**: Each question only covers the smallest semantic unit, ensuring the clarity and interpretability of the answers. - **Comprehensive coverage without hallucination**: The generated questions should cover all the content in the prompt and be limited to that content. - **Unique questions**: Avoid generating repeated questions. - **Effective dependencies**: Skip invalid subsequent questions based on the effectiveness of the answers. Through these improvements, DSG aims to improve the reliability and accuracy of the QG/A evaluation method, thereby better evaluating the performance of text - to - image generation models.

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

DSGEM: Dual Scene Graph Enhancement Module‐based Visual Question Answering

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Attention Redirection Transformer with Semantic Oriented Learning for Unbiased Scene Graph Generation

Bridging Visual and Textual Semantics: Towards Consistency for Unbiased Scene Graph Generation

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering

Understanding the Role of Scene Graphs in Visual Question Answering

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Interactive Visual Assessment for Text-to-Image Generation Models

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis