MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Scott L. Fleming,Alejandro Lozano,William J. Haberkorn,Jenelle A. Jindal,Eduardo P. Reis,Rahul Thapa,Louis Blankemeier,Julian Z. Genkins,Ethan Steinberg,Ashwin Nayak,Birju S. Patel,Chia-Chun Chiang,Alison Callahan,Zepeng Huo,Sergios Gatidis,Scott J. Adams,Oluseyi Fayanju,Shreya J. Shah,Thomas Savage,Ethan Goh,Akshay S. Chaudhari,Nima Aghaeepour,Christopher Sharp,Michael A. Pfeffer,Percy Liang,Jonathan H. Chen,Keith E. Morse,Emma P. Brunskill,Jason A. Fries,Nigam H. Shah
2023-12-24
Abstract:The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address several key issues in the application of large language models (LLMs) in the healthcare domain. Specifically: 1. **Evaluation Challenges**: Existing electronic health record (EHR) datasets do not adequately reflect the real-world complexity faced by clinicians, especially in terms of information needs and document handling. Therefore, evaluating the performance of LLMs on real medical tasks remains challenging. 2. **Task Diversity**: Clinicians need to complete a variety of information-related tasks, which are a significant burden and can lead to burnout. However, current evaluation methods do not accurately reflect the diversity and complexity of these tasks. To address these issues, the paper presents the following three main contributions: 1. **MedAlign Dataset**: Introduces a benchmark dataset named MedAlign, which contains 983 questions and instructions submitted by actual clinicians, covering 7 medical specialties. Among these, 303 instructions are accompanied by reference answers written by clinicians and corresponding EHR data as background information. This dataset allows for the evaluation of the quality of answers generated by different LLMs. 2. **Automatic Instruction-EHR Matching**: Demonstrates the feasibility of pairing instructions with relevant patient EHRs using a simple retrieval method. This approach improves the relevance of the matches and reduces the error rate compared to random pairing. 3. **Automatic Evaluation of LLM Responses**: Analyzes the correlation between clinician ratings and automated natural language generation (NLG) metrics, enabling large-scale evaluation without human involvement and reducing the need for clinician annotations in the future.