Abstract:Importance: The ability of large language models (LLMs) to generate high-quality, human-like text has been accompanied with speculation about their application in healthcare, alongside ethical and safety concerns. Objective: Evaluate LLM performance on medical natural language processing (NLP) tasks, benchmarked against other commercially available tools. Design: Observational study to evaluate and compare model performance. All models were commercially available and were evaluated without modification. Setting: The Text Analysis Coding (TAC) 2017 challenge was used to assess ability to perform medical coding using standard MedDRA preferred terms. Text from 55 publicly available de-identified medical transcription reports were annotated to identify pre-defined medical concepts (age, disease/symptom, body structure, medication name, and medication dosage). Participants: Publicly available, de-identified adverse event and medical transcription reports were used for evaluation. Exposures: For each task, general LLMs (GPT-3.5-turbo, GPT-4) were compared to commercially available healthcare NLP tools (Microsoft Text Analytics for Health, Amazon Comprehend Medical, IQVIA API Marketplace). Main Outcomes and Measures: For each NLP task, sensitivity, positive predictive value (PPV) and F1 score were calculated. Because GPT models had variable outputs, the range of metrics over 5 trials is reported. Results: For MedDRA coding, GPT-4 had similar F1 score performance to healthcare NLP algorithms (GPT-4: 0.67 to 0.73; Microsoft Text Analytics for Health: 0.66, IQVIA API Marketplace: 0.72), while GPT- 3.5-turbo had considerably lower performance (0.50 to 0.51). For medical information extraction, LLM performance varied widely across differing medical concepts; the highest F1 scores were for age (GPT- 3.5-turbo: 0.82 to 0.83, GPT-4: 0.84 to 0.87) and medication name (GPT-3.5-turbo: 0.55 to 0.59, GPT-4: 0.70 to 0.76), while F1 scores for disease/symptom, body structure, and medication dosage were lower than those observed for the healthcare NLP tools. GPT-3.5-turbo and GPT-4 generally had lower sensitivity than comparators. Conclusions and Relevance: In the absence of domain-specific fine tuning, GPT-4 performed similarly to healthcare-specific NLP tools on some tasks and less accurately on others; GPT-3.5-turbo was consistently less accurate than comparators. To maximize benefit and reduce risk of harm, robust quantitative evaluation for specific tasks should be performed prior to implementing LLMs in medical contexts.

Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

Optimal strategies for adapting open-source large language models for clinical information extraction: a benchmarking study in the context of ulcerative colitis research

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

The Open Source Advantage in Large Language Models (LLMs)

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Distilling large language models for matching patients to clinical trials

Advancing Question-Answering in Ophthalmology with Retrieval Augmented Generations (RAG): Benchmarking Open-source and Proprietary Large Language Models

Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI's LLM with Open Source SLMs in Production

The Battle of LLMs: A Comparative Study in Conversational QA Tasks

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

QLoRA: Efficient Finetuning of Quantized LLMs

Harnessing large language models' zero-shot and few-shot learning capabilities for regulatory research

Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations

Closing the gap between open-source and commercial large language models for medical evidence summarization

How well it works: Benchmarking performance of GPT models on medical natural language processing tasks

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

Deploying Open-Source Large Language Models: A performance Analysis