A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations
Qingyu Chen,Yan Hu,Xueqing Peng,Qianqian Xie,Qiao Jin,Aidan Gilson,Maxwell B. Singer,Xuguang Ai,Po-Ting Lai,Zhizheng Wang,Vipina Kuttichi Keloth,Kalpana Raja,Jiming Huang,Huan He,Fongci Lin,Jingcheng Du,Rui Zhang,W. Jim Zheng,Ron A. Adelman,Zhiyong Lu,Hua Xu
2024-09-30
Abstract:The biomedical literature is rapidly expanding, posing a significant challenge for manual curation and knowledge discovery. Biomedical Natural Language Processing (BioNLP) has emerged as a powerful solution, enabling the automated extraction of information and knowledge from this extensive literature. Recent attention has been directed towards Large Language Models (LLMs) due to their impressive performance. However, there remains a critical gap in understanding the effectiveness of LLMs in BioNLP tasks and their broader implications for method development and downstream users. Currently, there is a lack of baseline performance data, benchmarks, and practical recommendations for using LLMs in the biomedical domain. To address this gap, we present a systematic evaluation of four representative LLMs: GPT-3.5 and GPT-4 (closed-source), LLaMA 2 (open-sourced), and PMC LLaMA (domain-specific) across 12 BioNLP datasets covering six applications (named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification). The evaluation is conducted under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning. We compare these models against state-of-the-art (SOTA) approaches that fine-tune (domain-specific) BERT or BART models, which are well-established methods in BioNLP tasks. The evaluation covers both quantitative and qualitative evaluations, where the latter involves manually reviewing collectively hundreds of thousands of LLM outputs for inconsistencies, missing information, and hallucinations in extractive and classification tasks. The qualitative review also examines accuracy, 1 completeness, and readability in text summarization tasks. Additionally, a cost analysis of closed-source GPT models is conducted.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning