A Benchmark for Long-Form Medical Question Answering

Pedram Hosseini,Jessica M. Sin,Bing Ren,Bryceton G. Thomas,Elnaz Nouri,Ali Farahanchi,Saeed Hassanpour
2024-11-20
Abstract:There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: <a class="link-external link-https" href="https://github.com/lavita-ai/medical-eval-sphere" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the current lack of benchmarks for evaluating the performance of large - language models (LLMs) in long - form medical question - answering (QA). Most existing medical QA evaluation benchmarks mainly focus on automatic metrics and multiple - choice questions. Although these methods are valuable, they cannot comprehensively capture or evaluate the complexity of LLMs in real - world clinical applications. In addition, the evaluation of long - answer generation in existing research is mainly closed - source and lacks access to annotations by human medical experts, which makes the results difficult to reproduce and also hinders the improvement of existing baselines. To address these challenges, the authors introduce a new publicly available benchmark that contains real - world consumer medical questions annotated by medical doctors and their long - form answer evaluations. Through this benchmark, the authors conduct pairwise comparisons between medical and general LLMs responses from different sources, evaluating them based on criteria such as correctness, usefulness, harmfulness, and bias. In addition, a comprehensive LLM - as - judge analysis is also carried out to study the alignment between human judgment and LLMs. Preliminary results show that open LLMs have strong potential in medical QA and perform well compared to leading closed - source models.