OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu,Tianbin Li,Quanfeng Lu,Wenqi Shao,Junjun He,Yu Qiao,Ping Luo
2024-04-21
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this paper, we introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments, we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models, calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the evaluation issues of Large Vision-Language Models (LVLM) in the medical field. Specifically, although existing LVLMs perform excellently in various multimodal tasks, their potential in the medical domain has not been fully explored. The paper points out that there is currently a lack of a comprehensive and diverse evaluation benchmark, especially datasets covering various medical imaging modalities and human anatomical regions, which limits the understanding of LVLM performance in practical medical applications. To address this issue, the authors propose OmniMedVQA, a novel, large-scale, and comprehensive medical visual question answering benchmark dataset. This dataset includes 118,010 images from 73 different medical datasets, covering 12 different modalities and more than 20 human anatomical regions. All images are sourced from real medical scenarios, ensuring the dataset's consistency with the needs of the medical field and suitability for evaluating LVLM performance. Through extensive experimental evaluations, the study finds that existing LVLMs perform poorly in handling visual question answering tasks in the medical field, with medical-specific LVLMs even performing worse than general-domain models. This indicates that current models lack fundamental medical knowledge and that more flexible and powerful LVLMs need to be developed to meet the demands of medical applications.