Abstract:Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.

What problem does this paper attempt to address?

The paper aims to address the issues faced by blind and low vision (BLV) creators when using text-to-image generation tools. Specifically, the paper focuses on the following points: 1. **Existing Challenges**: BLV creators face numerous obstacles when creating or searching for images, including difficulties in using visual editing tools and evaluating image search results. Although existing automated descriptions (such as auto-captions, object detection, etc.) can provide some assistance, these descriptions are often insufficient for creators to fully understand the content and quality of the images. 2. **Accessibility of Text-to-Image Generation Tools**: While advanced text-to-image generation models (such as DALL-E, Stable Diffusion, etc.) can generate high-quality images based on text descriptions, these tools are not user-friendly for BLV users because they require users to visually inspect and iteratively optimize input prompts to select satisfactory images. 3. **GenAssist System**: The paper proposes a new system called GenAssist, aimed at improving the accessibility of the text-to-image generation process. This system helps BLV creators in the following ways: - **Verifying if the generated images match the prompts**: Allows users to confirm whether the generated images follow the original text descriptions. - **Extracting visual details not specified in the prompts**: Provides information about content in the images that was not mentioned in the prompts. - **Comparing similarities and differences between different images**: Helps users understand the distinctions between multiple candidate images through comparative descriptions. - **Interactive Questioning**: Supports users in asking questions about multiple images to obtain more information. 4. **User Study**: A study involving 12 BLV creators showed that GenAssist significantly enhances their ability to understand and select generated images, and increases their satisfaction with the image generation performance. In summary, the paper aims to improve the experience of BLV creators when using text-to-image generation tools by developing the GenAssist system, enabling them to more easily understand and select generated images.

GenAssist: Making Image Generation Accessible

Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Description Enhancement of Generated Images via Automatic Visual Question Generation and Answering

Interactive Visual Assessment for Text-to-Image Generation Models

RetAssist: Facilitating Vocabulary Learners with Generative Images in Story Retelling Practices

Evaluating Text-to-Visual Generation with Image-to-Text Generation

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

RealtimeGen: an Intervenable AI Image Generation System for Commercial Digital Art Asset Creators

Exploring the use of Generative AI to Support Automated Just-in-Time Programming for Visual Scene Displays

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

GenQuery: Supporting Expressive Visual Search with Generative Models

AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People

User-Friendly Customized Generation with Multi-Modal Prompts

Customization Assistant for Text-to-image Generation

Context-Aware Image Descriptions for Web Accessibility

Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation