Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim,Hojun Choi,Hyunjung Shim
2024-10-15
Abstract:Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing text-to-image models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (rho=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate text-to-image generation models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the phenomenon that text - to - image generation models (Text - to - Image, TTI) fail to accurately convey factual information when generating images, namely "image hallucination". Specifically, existing TTI models may generate images that do not conform to the facts when generating images, which may lead to misinformation and misunderstanding in applications in fields requiring high accuracy such as education and media. To evaluate this problem, the author proposes a new automated evaluation metric - I - HallA (Image Hallucination evaluation with Question Answering), and measures the factuality of generated images through Visual Question Answering (VQA). In addition, the author also constructs a benchmark dataset named I - HallA v1.0 to evaluate the degree of image hallucination. ### Specific problem description 1. **Image hallucination phenomenon**: Images generated by existing TTI models may not faithfully reflect factual content. For example, the generated images may contain incorrect details or elements that do not conform to reality. 2. **Insufficient evaluation**: Existing evaluation methods mainly rely on the alignment between text prompts and generated images, while ignoring important factual information in the images that is not explicitly mentioned. 3. **Difficulty in visual - semantic recognition**: Existing methods have difficulty in distinguishing whether the visual semantics in the generated images are accurate, especially when the text prompts are polysemous. ### Solution To solve these problems, the author proposes a new evaluation framework, which includes the following steps: 1. **Dataset construction**: - Collected 200 prompts and their corresponding factual images from science and history textbooks. - Used five TTI models to generate multiple images for each prompt, and selected the most obvious hallucination images as representative samples. 2. **Dataset enhancement**: - Utilized the knowledge base and visual understanding ability of GPT - 4 Omni (GPT - 4o) to generate factual reasoning and difficulty levels for each prompt. - Ensured the accuracy and consistency of the reasoning through manual review. 3. **Evaluation metric development**: - Constructed 1,000 multiple - choice question - answering sets (QA sets), each of which contains a question, five options and a correct answer. - Used the VQA model to evaluate the answering accuracy of the generated images to these QA sets and calculate the I - HallA score. ### Experimental results By evaluating five of the latest TTI models, the experimental results show that these models often fail to accurately convey factual information when generating images. The I - HallA score shows a strong Spearman correlation (ρ = 0.95) with human evaluation, verifying the validity and reliability of this metric. In conclusion, this paper aims to promote the research and development of TTI models to more accurately convey factual information when generating images by introducing the I - HallA evaluation metric and the I - HallA v1.0 benchmark dataset.