Optimal strategies for adapting open-source large language models for clinical information extraction: a benchmarking study in the context of ulcerative colitis research

Richard Paul Yim,Anna L Silverman,Shan Wang,Vivek Ashok Rudrapatna
DOI: https://doi.org/10.1101/2024.11.06.24316817
2024-11-07
Abstract:Background: Closed-source large language models (LLMs) like GPT-4o have shown promise for clinical information extraction but are potentially limited by cost, data security concerns, and inflexibility. Open-source models have emerged as an attractive alternative, with many LLM adaptation strategies developed in the literature. However, it is currently unclear what adaptation strategies are optimal, and how they ultimately compare to closed-source models. Methods: We studied the effects of three common LLM adaptation strategies: chain-of-thought prompting, few-shot prompting, and fine-tuning. Our target for information extraction was the Mayo Endoscopic Subscore (MES). We applied those strategies in all combinations to six open-source models (8-70 billion parameters) using an annotated set of colonoscopy procedure reports from two centers: the University of California, San Francisco (N=608) and San Francisco General Hospital (N=217). We analyzed the relationship of these strategies to several performance metrics with a mixed-effects model, accounting for the variability between centers and LLMs. GPT-4o was not subject to QLoRA due to its closed-source nature but was used as a comparator in our benchmarks. We also provide in-depth commentary on the cost-effectiveness of these open-source LLMs and GPT-4o for MES extraction. Results: Across adaptation strategies, QLoRA statistically (p<0.001) improves the performance of open-source LLMs by 8.3-15.6 percentage points across accuracy, precision and recall. However, GPT-4o with prompt engineering is superior to the best open-source model by a margin of 2.5-5.4%. Yet, a simple cost-effectiveness analysis suggests that GPT-4o is expensive compared to open-source models. Conclusion: GPT-4o is currently the most performant LLM for MES extraction . If unavailable, open-source models optimized with QLoRA are a competitive alternative. However, our results also suggest that current instruction-following LLMs including GPT-4o do not fully follow user-provided instructions, leaving room for improvement. More work is needed to achieve consistent, near-perfect performance in clinical information extraction by LLMs.
What problem does this paper attempt to address?