VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

Bufang Yang,Lixing He,Kaiwei Liu,Zhenyu Yan
2024-04-03
Abstract:Individuals with visual impairments, encompassing both partial and total difficulties in visual perception, are referred to as visually impaired (VI) people. An estimated 2.2 billion individuals worldwide are affected by visual impairments. Recent advancements in multi-modal large language models (MLLMs) have showcased their extraordinary capabilities across various domains. It is desirable to help VI individuals with MLLMs' great capabilities of visual understanding and reasoning. However, it is challenging for VI people to use MLLMs due to the difficulties in capturing the desirable images to fulfill their daily requests. For example, the target object is not fully or partially placed in the image. This paper explores how to leverage MLLMs for VI individuals to provide visual-question answers. VIAssist can identify undesired images and provide detailed actions. Finally, VIAssist can provide reliable answers to users' queries based on the images. Our results show that VIAssist provides +0.21 and +0.31 higher BERTScore and ROUGE scores than the baseline, respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multi - modal large language models (MLLMs) to help visually - impaired individuals (VI individuals) acquire the ability of visual question answering (VQA). Specifically, due to limited vision, visually - impaired individuals have difficulty capturing high - quality images to meet their daily needs, which results in the responses generated by existing MLLMs when processing these low - quality images are often unreliable. Therefore, the paper proposes a system named VIAssist, aiming to improve the adaptability and practicality of MLLMs for visually - impaired individuals in the following ways: 1. **Identify low - quality images**: VIAssist can identify whether the image uploaded by the user meets the requirements. If the image quality is poor or the target object does not fully appear in the image, the system will provide specific suggestions to guide the user to retake the photo. 2. **Provide detailed retaking guidance**: For low - quality images, VIAssist not only points out the problems but also gives specific adjustment suggestions, such as adjusting the shooting angle, distance, and lighting, etc., to help the user take higher - quality photos. 3. **Generate reliable answers**: When the user uploads high - quality images, VIAssist can provide reliable and accurate answers based on these images to meet the user's query needs. Through the above functions, VIAssist aims to improve the application effect of MLLMs among visually - impaired individuals, enabling them to better use these advanced technological tools to solve problems in daily life. By collecting instruction datasets specific to visually - impaired individuals and fine - tuning the model, the paper shows that VIAssist has a significant performance improvement compared with existing models, especially in terms of image quality and answer accuracy.