M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian,Viktor Schlegel,Abhinav Ramesh Kashyap,Thanh-Tung Nguyen,Vijay Prakash Dwivedi,Stefan Winkler

2024-06-06

Abstract:There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of evaluating the knowledge recall and comprehension abilities of large language models (LLMs) in the clinical and biomedical fields. Specifically, although the application of large language models in the medical field is becoming increasingly widespread, there is currently a lack of in-depth understanding of their ability to integrate information and recall relevant knowledge in these domains. To fill this gap, researchers conducted large-scale empirical studies using multiple-choice questions (MCQA) and generative question answering (AQA) methods, analyzing the performance of 15 LLMs across 22 datasets. The study found that while some domain-adapted models may lack in terms of knowledge volume, fine-tuning directly on collected medical knowledge datasets can yield encouraging results, even extending to unseen specialized subfields. Additionally, the research revealed a significant gap between the models' ability to simply recall necessary knowledge and their ability to integrate it into context. To promote research and collaboration in this field, the authors share the M-QALM resources, standardized methods, and evaluation results to advance the development of clinical knowledge representation learning.

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Large language models encode clinical knowledge

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Model Benchmarks in Medical Tasks

Large Language Model-Based Evaluation of Medical Question Answering Systems: Algorithm Development and Case Study

Large Language Models in Healthcare: A Comprehensive Benchmark

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

Leveraging Large Language Models for Multiple Choice Question Answering

MedLM: Exploring Language Models for Medical Question Answering Systems

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large language models in healthcare and medical domain: A review

Towards Expert-Level Medical Question Answering with Large Language Models

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Large language models in medical and healthcare fields: applications, advances, and challenges

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation

MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering