Abstract:Importance: The ability of large language models (LLMs) to generate high-quality, human-like text has been accompanied with speculation about their application in healthcare, alongside ethical and safety concerns. Objective: Evaluate LLM performance on medical natural language processing (NLP) tasks, benchmarked against other commercially available tools. Design: Observational study to evaluate and compare model performance. All models were commercially available and were evaluated without modification. Setting: The Text Analysis Coding (TAC) 2017 challenge was used to assess ability to perform medical coding using standard MedDRA preferred terms. Text from 55 publicly available de-identified medical transcription reports were annotated to identify pre-defined medical concepts (age, disease/symptom, body structure, medication name, and medication dosage). Participants: Publicly available, de-identified adverse event and medical transcription reports were used for evaluation. Exposures: For each task, general LLMs (GPT-3.5-turbo, GPT-4) were compared to commercially available healthcare NLP tools (Microsoft Text Analytics for Health, Amazon Comprehend Medical, IQVIA API Marketplace). Main Outcomes and Measures: For each NLP task, sensitivity, positive predictive value (PPV) and F1 score were calculated. Because GPT models had variable outputs, the range of metrics over 5 trials is reported. Results: For MedDRA coding, GPT-4 had similar F1 score performance to healthcare NLP algorithms (GPT-4: 0.67 to 0.73; Microsoft Text Analytics for Health: 0.66, IQVIA API Marketplace: 0.72), while GPT- 3.5-turbo had considerably lower performance (0.50 to 0.51). For medical information extraction, LLM performance varied widely across differing medical concepts; the highest F1 scores were for age (GPT- 3.5-turbo: 0.82 to 0.83, GPT-4: 0.84 to 0.87) and medication name (GPT-3.5-turbo: 0.55 to 0.59, GPT-4: 0.70 to 0.76), while F1 scores for disease/symptom, body structure, and medication dosage were lower than those observed for the healthcare NLP tools. GPT-3.5-turbo and GPT-4 generally had lower sensitivity than comparators. Conclusions and Relevance: In the absence of domain-specific fine tuning, GPT-4 performed similarly to healthcare-specific NLP tools on some tasks and less accurately on others; GPT-3.5-turbo was consistently less accurate than comparators. To maximize benefit and reduce risk of harm, robust quantitative evaluation for specific tasks should be performed prior to implementing LLMs in medical contexts.

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

How well it works: Benchmarking performance of GPT models on medical natural language processing tasks

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Prompt-Efficient Fine-Tuning for GPT-like Deep Models to Reduce Hallucination and to Improve Reproducibility in Scientific Text Generation Using Stochastic Optimisation Techniques

DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Distilling large language models for matching patients to clinical trials

Large Language Models aren't all that you need