@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Xin Jiang,Junwei Zheng,Ruiping Liu,Jiahang Li,Jiaming Zhang,Sven Matthiesen,Rainer Stiefelhagen
2024-09-22
Abstract:As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of a standardized evaluation platform for existing Vision - Language Models (VLMs) when providing multi - task assistive technology (Assistive Technology, AT) for people with visual impairments (PVIs). Specifically, the author points out: 1. **Limitations of existing VLMs**: Although VLMs perform well in multi - task processing, benchmarking in the field of assistive technology for people with visual impairments is still insufficient. The benchmarking of existing large - language models (LLMs) and large - vision - language models (LVLMs) focuses more on general application scenarios and ignores the special needs of people with visual impairments. 2. **Challenges in multi - task processing**: Existing methods are difficult to efficiently handle multiple tasks simultaneously and are insufficient in interpreting complex scenes and providing situation - related information. This poses a challenge to meeting the needs of people with visual impairments. 3. **Importance of specific tasks**: People with visual impairments have higher demands and usage frequencies for certain specific tasks (such as text recognition, object recognition, etc.). However, existing benchmarking does not fully cover these tasks. To address these problems, the author proposes a new Vision - Language Assistive Technology Benchmarking Platform (@B ENCH) and a multi - task general model (@M ODEL) to evaluate and improve the performance of VLMs in assistive technology. By combining user studies and multi - modal tasks, @B ENCH and @M ODEL aim to better meet the needs of people with visual impairments and improve the efficiency and performance of multi - task processing. ### Main contributions 1. **User - driven design**: Through human - computer interaction studies with people with visual impairments, the five most important Vision - Language tasks were determined. 2. **New benchmarking platform**: A Vision - Language benchmarking platform containing five representative tasks was released, covering common scenarios in the daily lives of people with visual impairments. 3. **Multi - task general model**: A new general model was proposed that can perform multiple tasks under the same set of parameters, significantly reducing the number of parameters and computational costs. 4. **Balance between efficiency and performance**: Through experimental verification, the high efficiency and high performance of the model in multi - task processing were demonstrated. ### Summary This paper fills the gap in benchmarking of Vision - Language models in the field of assistive technology and provides important references and tools for future research.