Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Yinghao Zhu,Junyi Gao,Zixiang Wang,Weibin Liao,Xiaochen Zheng,Lifang Liang,Yasha Wang,Chengwei Pan,Ewen M. Harrison,Liantao Ma
2024-07-26
Abstract:The use of Large Language Models (LLMs) in medicine is growing, but their ability to handle both structured Electronic Health Record (EHR) data and unstructured clinical notes is not well-studied. This study benchmarks various models, including GPT-based LLMs, BERT-based models, and traditional clinical predictive models, for non-generative medical tasks utilizing renowned datasets. We assessed 14 language models (9 GPT-based and 5 BERT-based) and 7 traditional predictive models using the MIMIC dataset (ICU patient records) and the TJH dataset (early COVID-19 EHR data), focusing on tasks such as mortality and readmission prediction, disease hierarchy reconstruction, and biomedical sentence matching, comparing both zero-shot and finetuned performance. Results indicated that LLMs exhibited robust zero-shot predictive capabilities on structured EHR data when using well-designed prompting strategies, frequently surpassing traditional models. However, for unstructured medical texts, LLMs did not outperform finetuned BERT models, which excelled in both supervised and unsupervised tasks. Consequently, while LLMs are effective for zero-shot learning on structured data, finetuned BERT models are more suitable for unstructured texts, underscoring the importance of selecting models based on specific task requirements and data characteristics to optimize the application of NLP technology in healthcare.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the evaluation of large language models (LLMs) in non-generative medical tasks, particularly their ability to handle structured electronic health record (EHR) data and unstructured clinical notes. Specifically, the researchers aim to answer this question through the following aspects: 1. **Performance of structured EHR data in non-generative clinical prediction tasks**: - How do LLMs compare to traditional small expert models in terms of performance? - Can enhanced prompting strategies improve LLMs' understanding and prediction accuracy of structured medical data? - In zero-shot or few-shot learning settings, can LLMs directly make predictions on new datasets, thereby evaluating their generalization ability in clinical applications? 2. **Performance of unstructured clinical free-text data in non-generative clinical NLP tasks**: - Do LLMs outperform traditional BERT-based models in supervised tasks of extracting clinical semantics from clinical notes? - Do LLMs exhibit a deeper understanding of clinical concepts and better embedding capabilities in unsupervised tasks? To answer these questions, the researchers designed a comprehensive benchmarking framework covering multiple representative tasks, including in-hospital mortality prediction, 30-day readmission prediction, medical sentence matching, and ICD code clustering. These tasks were evaluated on two widely used datasets (MIMIC and TJH) to ensure the generalizability and reliability of the results. Through these evaluations, the researchers hope to provide practical recommendations for medical researchers in selecting the optimal model and to explore the applicability and limitations of LLMs in different clinical tasks.