OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu,Tianbin Li,Quanfeng Lu,Wenqi Shao,Junjun He,Yu Qiao,Ping Luo

2024-04-21

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this paper, we introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments, we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models, calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at

Image and Video Processing,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the evaluation issues of Large Vision-Language Models (LVLM) in the medical field. Specifically, although existing LVLMs perform excellently in various multimodal tasks, their potential in the medical domain has not been fully explored. The paper points out that there is currently a lack of a comprehensive and diverse evaluation benchmark, especially datasets covering various medical imaging modalities and human anatomical regions, which limits the understanding of LVLM performance in practical medical applications. To address this issue, the authors propose OmniMedVQA, a novel, large-scale, and comprehensive medical visual question answering benchmark dataset. This dataset includes 118,010 images from 73 different medical datasets, covering 12 different modalities and more than 20 human anatomical regions. All images are sourced from real medical scenarios, ensuring the dataset's consistency with the needs of the medical field and suitability for evaluating LVLM performance. Through extensive experimental evaluations, the study finds that existing LVLMs perform poorly in handling visual question answering tasks in the medical field, with medical-specific LVLMs even performing worse than general-domain models. This indicates that current models lack fundamental medical knowledge and that more flexible and powerful LVLMs need to be developed to meet the demands of medical applications.

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Training Medical Large Vision-Language Models with Abnormal-Aware Feedback

Visual Question Answering in the Medical Domain

BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data.

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Large Language Model Benchmarks in Medical Tasks

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation