Abstract:Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research. The code is publicly available on Github.

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Towards Reducing Diagnostic Errors with Interpretable Risk Prediction

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Benchmarking Large Language Models in Evidence-Based Medicine

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

Large Language Model-Driven Evaluation of Medical Records Using MedCheckLLM

A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs)

LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Unlocking the Potential of Free Text in Electronic Health Records with Large Language Models (LLM): Enhancing Patient Safety and Consultation Interactions

LLMs Accelerate Annotation for Medical Information Extraction

Answering real-world clinical questions using large language model based systems

Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Data extraction for evidence synthesis using a large language model: A proof‐of‐concept study

SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

LLMs in Biomedicine: A study on clinical Named Entity Recognition

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Scalable information extraction from free text electronic health records using large language models

Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

Prompting Large Language Models for Zero-Shot Clinical Prediction with Structured Longitudinal Electronic Health Record Data