Abstract:Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real medical reports and specialized in-depth reasoning capabilities. In this work, we introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization, which poses several challenges: comprehensively interpreting imgage content across diverse challenging layouts, possessing numerical reasoning ability to identify abnormal indicators and demonstrating clinical reasoning ability to provide statements of disease diagnosis, status and advice based on medical contexts. We carefully design the data generation pipeline and proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed at restoring textual and tabular content in medical report images. This method substantially enhances annotation efficiency, doubling the productivity of each annotator, and yields a 26.8% improvement in accuracy. We conduct extensive evaluations, including few-shot assessments of 5 LMMs which are capable of solving Chinese medical QA tasks. To further investigate the limitations and potential of current LMMs, we conduct comparative experiments on a set of strong LLMs by using image-text generated by ESRA method. We report the performance of baselines and offer several observations: (1) The overall performance of existing LMMs is still limited; however LMMs more robust to low-quality and diverse-structured images compared to LLMs. (3) Reasoning across context and image content present significant challenges. We hope this benchmark helps the community make progress on these challenging tasks in multi-modal medical document understanding and facilitate its application in healthcare.

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Large Language Models in Healthcare: A Comprehensive Benchmark

A Benchmark for Long-Form Medical Question Answering

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large language models encode clinical knowledge

Dynamic Q&A of Clinical Documents with Large Language Models

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Towards Expert-Level Medical Question Answering with Large Language Models

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Large Language Model Benchmarks in Medical Tasks

LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

Large Language Model-Based Evaluation of Medical Question Answering Systems: Algorithm Development and Case Study

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Answering real-world clinical questions using large language model based systems