Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Yang Nan,Huichi Zhou,Xiaodan Xing,Guang Yang
2024-08-16
Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models' ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models' capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of inadequate performance evaluation of Large Vision-Language Models (LVLMs) in the medical field. Despite the impressive performance of LVLMs in various tasks and the widespread attention they have garnered in the AI community in recent years, their performance and reliability in specialized fields like medicine have not been thoroughly assessed. Specifically, existing evaluations mostly focus on visual question answering (VQA) on simple multimodal data, neglecting the deeper characteristics of LVLMs in medical scenarios. To this end, the authors propose a new benchmark dataset, RadVUQA, for comprehensive evaluation of existing LVLMs. RadVUQA evaluates models from five dimensions: 1. **Anatomical Understanding**: Evaluates the model's ability to recognize biological structures. 2. **Multimodal Understanding**: Assesses the model's ability to interpret language and visual instructions. 3. **Quantitative and Spatial Reasoning**: Evaluates the model's spatial awareness and its ability to combine quantitative analysis, visual, and language information. 4. **Physiological Knowledge**: Assesses the model's understanding of organ functions and mechanisms. 5. **Robustness**: Evaluates the model's performance on discordant or synthetic data. The study finds that both general-purpose LVLMs and medical-specific LVLMs exhibit significant deficiencies in multimodal understanding and quantitative reasoning. These findings reveal a substantial gap between existing LVLMs and clinical needs, highlighting the urgent need to develop more robust and intelligent LVLMs.