Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Yang Nan,Huichi Zhou,Xiaodan Xing,Guang Yang

2024-08-16

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models' ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models' capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of inadequate performance evaluation of Large Vision-Language Models (LVLMs) in the medical field. Despite the impressive performance of LVLMs in various tasks and the widespread attention they have garnered in the AI community in recent years, their performance and reliability in specialized fields like medicine have not been thoroughly assessed. Specifically, existing evaluations mostly focus on visual question answering (VQA) on simple multimodal data, neglecting the deeper characteristics of LVLMs in medical scenarios. To this end, the authors propose a new benchmark dataset, RadVUQA, for comprehensive evaluation of existing LVLMs. RadVUQA evaluates models from five dimensions: 1. **Anatomical Understanding**: Evaluates the model's ability to recognize biological structures. 2. **Multimodal Understanding**: Assesses the model's ability to interpret language and visual instructions. 3. **Quantitative and Spatial Reasoning**: Evaluates the model's spatial awareness and its ability to combine quantitative analysis, visual, and language information. 4. **Physiological Knowledge**: Assesses the model's understanding of organ functions and mechanisms. 5. **Robustness**: Evaluates the model's performance on discordant or synthetic data. The study finds that both general-purpose LVLMs and medical-specific LVLMs exhibit significant deficiencies in multimodal understanding and quantitative reasoning. These findings reveal a substantial gap between existing LVLMs and clinical needs, highlighting the urgent need to develop more robust and intelligent LVLMs.

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Effectiveness Assessment of Recent Large Vision-Language Models

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

A Survey of Medical Vision-and-Language Applications and Their Techniques

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Training Medical Large Vision-Language Models with Abnormal-Aware Feedback

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical