MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso,Maite Oronoz,Rodrigo Agerri

2024-07-29

Abstract:Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address several key issues of large language models (LLMs) in medical question-answering tasks: 1. **Factual Accuracy**: Current LLMs tend to generate content that appears reasonable but is actually inaccurate (hallucination). 2. **Knowledge Update**: The pre-training data of LLMs may not be up-to-date, leading to a lag in their knowledge base. 3. **Reasoning Ability Evaluation**: Existing medical question-answering benchmarks lack reference explanations written by professional doctors, making it impossible to evaluate the reasoning process of LLMs when answering medical questions. 4. **Insufficient Multilingual Support**: Most current evaluations are limited to English, with significant lack of support for other languages. To address these issues, the authors propose MedExpQA—the first multilingual benchmark framework based on medical exams, which includes reference explanations for both correct and incorrect options written by professional doctors. Through this benchmark, researchers can better evaluate the performance of LLMs in medical question-answering tasks and explore how to leverage external medical knowledge to improve these models' performance. Additionally, MedExpQA particularly emphasizes the need to enhance LLMs' performance in non-English language environments.

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

A Benchmark for Long-Form Medical Question Answering

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Towards Expert-Level Medical Question Answering with Large Language Models

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Large language models encode clinical knowledge

Heterogeneous Knowledge Grounding for Medical Question Answering with Retrieval Augmented Large Language Model

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

MedLM: Exploring Language Models for Medical Question Answering Systems

Towards Evaluating and Building Versatile Large Language Models for Medicine

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Integrating UMLS Knowledge into Large Language Models for Medical Question Answering

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark