VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

Bufang Yang,Lixing He,Kaiwei Liu,Zhenyu Yan

2024-04-03

Abstract:Individuals with visual impairments, encompassing both partial and total difficulties in visual perception, are referred to as visually impaired (VI) people. An estimated 2.2 billion individuals worldwide are affected by visual impairments. Recent advancements in multi-modal large language models (MLLMs) have showcased their extraordinary capabilities across various domains. It is desirable to help VI individuals with MLLMs' great capabilities of visual understanding and reasoning. However, it is challenging for VI people to use MLLMs due to the difficulties in capturing the desirable images to fulfill their daily requests. For example, the target object is not fully or partially placed in the image. This paper explores how to leverage MLLMs for VI individuals to provide visual-question answers. VIAssist can identify undesired images and provide detailed actions. Finally, VIAssist can provide reliable answers to users' queries based on the images. Our results show that VIAssist provides +0.21 and +0.31 higher BERTScore and ROUGE scores than the baseline, respectively.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use multi - modal large language models (MLLMs) to help visually - impaired individuals (VI individuals) acquire the ability of visual question answering (VQA). Specifically, due to limited vision, visually - impaired individuals have difficulty capturing high - quality images to meet their daily needs, which results in the responses generated by existing MLLMs when processing these low - quality images are often unreliable. Therefore, the paper proposes a system named VIAssist, aiming to improve the adaptability and practicality of MLLMs for visually - impaired individuals in the following ways: 1. **Identify low - quality images**: VIAssist can identify whether the image uploaded by the user meets the requirements. If the image quality is poor or the target object does not fully appear in the image, the system will provide specific suggestions to guide the user to retake the photo. 2. **Provide detailed retaking guidance**: For low - quality images, VIAssist not only points out the problems but also gives specific adjustment suggestions, such as adjusting the shooting angle, distance, and lighting, etc., to help the user take higher - quality photos. 3. **Generate reliable answers**: When the user uploads high - quality images, VIAssist can provide reliable and accurate answers based on these images to meet the user's query needs. Through the above functions, VIAssist aims to improve the application effect of MLLMs among visually - impaired individuals, enabling them to better use these advanced technological tools to solve problems in daily life. By collecting instruction datasets specific to visually - impaired individuals and fine - tuning the model, the paper shows that VIAssist has a significant performance improvement compared with existing models, especially in terms of image quality and answer accuracy.

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

InfMLLM: A Unified Framework for Visual-Language Tasks.

A-VL: Adaptive Attention for Large Vision-Language Models

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models