Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

Jaemin Cho,Yushi Hu,Roopal Garg,Peter Anderson,Ranjay Krishna,Jason Baldridge,Mohit Bansal,Jordi Pont-Tuset,Su Wang
2024-03-14
Abstract:Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the reliability challenge in the evaluation of text - to - image generation models (Text - to - Image, T2I). Specifically, the existing evaluation methods based on Question Generation and Answering (QG/A) have several major problems: 1. **Question Generation (QG) stage**: - **Hallucination problem**: The generated questions may contain information that does not exist in the original text. - **Repetition problem**: The generated questions may be repeated, leading to bias in the evaluation results. - **Omission problem**: The generated questions may omit certain key information in the original text. - **Non - atomic problem**: The generated questions may contain multiple details, making the answers difficult to interpret. 2. **Visual Question Answering (VQA) stage**: - **Inconsistent answers**: For the same object, the VQA model may give contradictory answers to different questions. For example, a model may first say "There is no motorcycle" and then say "The motorcycle is blue". To address these problems, the paper proposes the Davidsonian Scene Graph (DSG) framework, which improves the QG/A evaluation method in the following ways: - **Atomic questions**: Each question only covers the smallest semantic unit, ensuring the clarity and interpretability of the answers. - **Comprehensive coverage without hallucination**: The generated questions should cover all the content in the prompt and be limited to that content. - **Unique questions**: Avoid generating repeated questions. - **Effective dependencies**: Skip invalid subsequent questions based on the effectiveness of the answers. Through these improvements, DSG aims to improve the reliability and accuracy of the QG/A evaluation method, thereby better evaluating the performance of text - to - image generation models.