A Benchmark for Long-Form Medical Question Answering

Pedram Hosseini,Jessica M. Sin,Bing Ren,Bryceton G. Thomas,Elnaz Nouri,Ali Farahanchi,Saeed Hassanpour

2024-11-20

Abstract:There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: <a class="link-external link-https" href="https://github.com/lavita-ai/medical-eval-sphere" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the current lack of benchmarks for evaluating the performance of large - language models (LLMs) in long - form medical question - answering (QA). Most existing medical QA evaluation benchmarks mainly focus on automatic metrics and multiple - choice questions. Although these methods are valuable, they cannot comprehensively capture or evaluate the complexity of LLMs in real - world clinical applications. In addition, the evaluation of long - answer generation in existing research is mainly closed - source and lacks access to annotations by human medical experts, which makes the results difficult to reproduce and also hinders the improvement of existing baselines. To address these challenges, the authors introduce a new publicly available benchmark that contains real - world consumer medical questions annotated by medical doctors and their long - form answer evaluations. Through this benchmark, the authors conduct pairwise comparisons between medical and general LLMs responses from different sources, evaluating them based on criteria such as correctness, usefulness, harmfulness, and bias. In addition, a comprehensive LLM - as - judge analysis is also carried out to study the alignment between human judgment and LLMs. Preliminary results show that open LLMs have strong potential in medical QA and perform well compared to leading closed - source models.

A Benchmark for Long-Form Medical Question Answering

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Large Language Models in Healthcare: A Comprehensive Benchmark

Towards Expert-Level Medical Question Answering with Large Language Models

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

LongHealth: A Question Answering Benchmark with Long Clinical Documents

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

K-QA: A Real-World Medical Q&A Benchmark

Large language models encode clinical knowledge

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Benchmarking the Confidence of Large Language Models in Clinical Questions

Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

MedLM: Exploring Language Models for Medical Question Answering Systems

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

Coombs-negative Autoimmune Hemolytic Anemia Followed by Anti-erythropoetin Receptor Antibody-associated Pure Red Cell Aplasia: A Case Report and Review of Literature.

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark