Description Enhancement of Generated Images via Automatic Visual Question Generation and Answering

Mina Huh,Aneesh Shetty
Abstract:Advances in text-to-image generation models let creators generate multiple high-fidelity images based on a text description (i.e. prompt). Yet, for people with visual impairments, it is difficult to assess the content and quality of the generated images and compare them to choose one. We propose a pipeline to generate rich description of AI generated images to assist broader users to understand them. In our pipeline, we use a large language model (GPT-4) to generate visual questions, vision-language models (BLIP-2) to extract answers, and a large language model (GPT-4) to summarize the results into final description. We evaluate the efficacy of our pipeline in comparison with a baseline image-captioning model and human describers. To further improve the visual grounding and accuracy of the answering pipeline, we experiment using foundation image segmentation model as an oracle to aid in visual question Answering.
Computer Science
What problem does this paper attempt to address?