Abstract:Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real medical reports and specialized in-depth reasoning capabilities. In this work, we introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization, which poses several challenges: comprehensively interpreting imgage content across diverse challenging layouts, possessing numerical reasoning ability to identify abnormal indicators and demonstrating clinical reasoning ability to provide statements of disease diagnosis, status and advice based on medical contexts. We carefully design the data generation pipeline and proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed at restoring textual and tabular content in medical report images. This method substantially enhances annotation efficiency, doubling the productivity of each annotator, and yields a 26.8% improvement in accuracy. We conduct extensive evaluations, including few-shot assessments of 5 LMMs which are capable of solving Chinese medical QA tasks. To further investigate the limitations and potential of current LMMs, we conduct comparative experiments on a set of strong LLMs by using image-text generated by ESRA method. We report the performance of baselines and offer several observations: (1) The overall performance of existing LMMs is still limited; however LMMs more robust to low-quality and diverse-structured images compared to LLMs. (3) Reasoning across context and image content present significant challenges. We hope this benchmark helps the community make progress on these challenging tasks in multi-modal medical document understanding and facilitate its application in healthcare.

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions

FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Large Language Models for Multi-Choice Question Classification of Medical Subjects

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Development of an Extractive Clinical Question Answering Dataset with Multi-Answer and Multi-Focus Questions

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

AI-Powered Test Question Generation in Medical Education: The DailyMed Approach

Towards Expert-Level Medical Question Answering with Large Language Models

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Survey of Multimodal Medical Question Answering

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

PubMedQA: A Dataset for Biomedical Research Question Answering

A dataset for medical instructional video classification and question answering

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning