Abstract:Importance: The ability of large language models (LLMs) to generate high-quality, human-like text has been accompanied with speculation about their application in healthcare, alongside ethical and safety concerns. Objective: Evaluate LLM performance on medical natural language processing (NLP) tasks, benchmarked against other commercially available tools. Design: Observational study to evaluate and compare model performance. All models were commercially available and were evaluated without modification. Setting: The Text Analysis Coding (TAC) 2017 challenge was used to assess ability to perform medical coding using standard MedDRA preferred terms. Text from 55 publicly available de-identified medical transcription reports were annotated to identify pre-defined medical concepts (age, disease/symptom, body structure, medication name, and medication dosage). Participants: Publicly available, de-identified adverse event and medical transcription reports were used for evaluation. Exposures: For each task, general LLMs (GPT-3.5-turbo, GPT-4) were compared to commercially available healthcare NLP tools (Microsoft Text Analytics for Health, Amazon Comprehend Medical, IQVIA API Marketplace). Main Outcomes and Measures: For each NLP task, sensitivity, positive predictive value (PPV) and F1 score were calculated. Because GPT models had variable outputs, the range of metrics over 5 trials is reported. Results: For MedDRA coding, GPT-4 had similar F1 score performance to healthcare NLP algorithms (GPT-4: 0.67 to 0.73; Microsoft Text Analytics for Health: 0.66, IQVIA API Marketplace: 0.72), while GPT- 3.5-turbo had considerably lower performance (0.50 to 0.51). For medical information extraction, LLM performance varied widely across differing medical concepts; the highest F1 scores were for age (GPT- 3.5-turbo: 0.82 to 0.83, GPT-4: 0.84 to 0.87) and medication name (GPT-3.5-turbo: 0.55 to 0.59, GPT-4: 0.70 to 0.76), while F1 scores for disease/symptom, body structure, and medication dosage were lower than those observed for the healthcare NLP tools. GPT-3.5-turbo and GPT-4 generally had lower sensitivity than comparators. Conclusions and Relevance: In the absence of domain-specific fine tuning, GPT-4 performed similarly to healthcare-specific NLP tools on some tasks and less accurately on others; GPT-3.5-turbo was consistently less accurate than comparators. To maximize benefit and reduce risk of harm, robust quantitative evaluation for specific tasks should be performed prior to implementing LLMs in medical contexts.

Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

A framework for human evaluation of large language models in healthcare derived from literature review

Ascle-A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation

How well it works: Benchmarking performance of GPT models on medical natural language processing tasks

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study

An Investigation of Evaluation Metrics for Automated Medical Note Generation

Generation and evaluation of artificial mental health records for Natural Language Processing

Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Extension of AC–DC Transfer Standards From 100 Down to 2 mV Using RVDs

COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain