How well it works: Benchmarking performance of GPT models on medical natural language processing tasks

Hui Feng,Kathryn Rough,Paul B Milligan,Francesco Tombini,Tom Kwon,Khaldoun Zine El Abidine,Christina D Mack,Benjamin Hughes
DOI: https://doi.org/10.1101/2024.06.10.24308699
2024-06-12
Abstract:Importance: The ability of large language models (LLMs) to generate high-quality, human-like text has been accompanied with speculation about their application in healthcare, alongside ethical and safety concerns. Objective: Evaluate LLM performance on medical natural language processing (NLP) tasks, benchmarked against other commercially available tools. Design: Observational study to evaluate and compare model performance. All models were commercially available and were evaluated without modification. Setting: The Text Analysis Coding (TAC) 2017 challenge was used to assess ability to perform medical coding using standard MedDRA preferred terms. Text from 55 publicly available de-identified medical transcription reports were annotated to identify pre-defined medical concepts (age, disease/symptom, body structure, medication name, and medication dosage). Participants: Publicly available, de-identified adverse event and medical transcription reports were used for evaluation. Exposures: For each task, general LLMs (GPT-3.5-turbo, GPT-4) were compared to commercially available healthcare NLP tools (Microsoft Text Analytics for Health, Amazon Comprehend Medical, IQVIA API Marketplace). Main Outcomes and Measures: For each NLP task, sensitivity, positive predictive value (PPV) and F1 score were calculated. Because GPT models had variable outputs, the range of metrics over 5 trials is reported. Results: For MedDRA coding, GPT-4 had similar F1 score performance to healthcare NLP algorithms (GPT-4: 0.67 to 0.73; Microsoft Text Analytics for Health: 0.66, IQVIA API Marketplace: 0.72), while GPT- 3.5-turbo had considerably lower performance (0.50 to 0.51). For medical information extraction, LLM performance varied widely across differing medical concepts; the highest F1 scores were for age (GPT- 3.5-turbo: 0.82 to 0.83, GPT-4: 0.84 to 0.87) and medication name (GPT-3.5-turbo: 0.55 to 0.59, GPT-4: 0.70 to 0.76), while F1 scores for disease/symptom, body structure, and medication dosage were lower than those observed for the healthcare NLP tools. GPT-3.5-turbo and GPT-4 generally had lower sensitivity than comparators. Conclusions and Relevance: In the absence of domain-specific fine tuning, GPT-4 performed similarly to healthcare-specific NLP tools on some tasks and less accurately on others; GPT-3.5-turbo was consistently less accurate than comparators. To maximize benefit and reduce risk of harm, robust quantitative evaluation for specific tasks should be performed prior to implementing LLMs in medical contexts.
Health Informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of large language models (LLMs) in medical natural language processing (NLP) tasks and benchmark them against existing commercially available tools. Specifically, the researchers hope to understand through this study: 1. **How do LLMs perform in medical NLP tasks**: In particular, without specific fine - tuning for the medical field, whether these models can effectively perform tasks such as medical coding and information extraction. 2. **How do LLMs perform compared to professional medical NLP tools**: By comparing the performance of general - purpose LLMs such as GPT - 3.5 - turbo and GPT - 4 with professional medical NLP tools such as Microsoft Text Analytics for Health, Amazon Comprehend Medical, and IQVIA API Marketplace, evaluate the applicability and potential advantages or disadvantages of LLMs in the medical field. 3. **Differences in the performance of LLMs in extracting different medical concepts**: Study the performance of LLMs in extracting different types of information such as age, disease/symptom, body structure, drug name, and drug dosage to understand their scope of application and limitations in different tasks. Through these evaluations, the researchers hope to provide empirical support for the rational use of LLMs in the medical field, and also reveal possible risks and challenges, thereby providing guidance for future application and development directions.