GenAssist: Making Image Generation Accessible

Mina Huh,Yi-Hao Peng,Amy Pavel
DOI: https://doi.org/10.48550/arXiv.2307.07589
IF: 6.4588
2023-07-14
Human-Computer Interaction
Abstract:Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.
What problem does this paper attempt to address?
The paper aims to address the issues faced by blind and low vision (BLV) creators when using text-to-image generation tools. Specifically, the paper focuses on the following points: 1. **Existing Challenges**: BLV creators face numerous obstacles when creating or searching for images, including difficulties in using visual editing tools and evaluating image search results. Although existing automated descriptions (such as auto-captions, object detection, etc.) can provide some assistance, these descriptions are often insufficient for creators to fully understand the content and quality of the images. 2. **Accessibility of Text-to-Image Generation Tools**: While advanced text-to-image generation models (such as DALL-E, Stable Diffusion, etc.) can generate high-quality images based on text descriptions, these tools are not user-friendly for BLV users because they require users to visually inspect and iteratively optimize input prompts to select satisfactory images. 3. **GenAssist System**: The paper proposes a new system called GenAssist, aimed at improving the accessibility of the text-to-image generation process. This system helps BLV creators in the following ways: - **Verifying if the generated images match the prompts**: Allows users to confirm whether the generated images follow the original text descriptions. - **Extracting visual details not specified in the prompts**: Provides information about content in the images that was not mentioned in the prompts. - **Comparing similarities and differences between different images**: Helps users understand the distinctions between multiple candidate images through comparative descriptions. - **Interactive Questioning**: Supports users in asking questions about multiple images to obtain more information. 4. **User Study**: A study involving 12 BLV creators showed that GenAssist significantly enhances their ability to understand and select generated images, and increases their satisfaction with the image generation performance. In summary, the paper aims to improve the experience of BLV creators when using text-to-image generation tools by developing the GenAssist system, enabling them to more easily understand and select generated images.