Abstract:Background: Few studies have explored the degree to which fine-tuning a large-language model (LLM) can improve its ability to answer a specific set of questions about a research study. Methods: We created an instruction set comprising 250 marked-down studies of HIV drug resistance, 16 questions per study, answers to each question, and explanations for each answer. The questions were broadly relevant to studies of pathogenic human viruses including whether a study reported viral genetic sequences and the demographics and antiviral treatments of the persons from whom sequences were obtained. We fine-tuned GPT-4o-mini (GPT-4o), Llama3.1-8B-Instruct (Llama3.1-8B), and Llama3.1-70B-Instruct (Llama3.1-70B) using a quantized low rank adapter (QLoRA). We assessed the accuracy, precision, and recall of each base and fine-tuned model in answering the same questions on a test set comprising 120 different studies. Paired t-tests and Wilcoxon signed-rank tests were used to compare base models to one another, fine-tuned models to their respective base model, and the fine-tuned models to one another. Results: Prior to fine-tuning, GPT-4o displayed significantly greater performance than both Llama3.1-70B and Llama3.1-8B due to its greater precision compared with Llama3.1-70B and greater precision and recall compared with Llama3.1-8B; there was no difference in performance between Llama3.1-70B and Llama3.1-8B. After fine-tuning, both GPT-4o and Llama3.1-70B, but not Llama3.1-8B, displayed significantly improved performance compared with their base models. The improved performance of GPT-4o resulted from a mean 6% increased precision and 9% increased recall; the improved performance of Llama3.1-70B resulted from a 15% increased precision. After fine-tuning, Llama3.1-70B significantly outperformed Llama3.1-8B but did not perform as well as the fine-tuned GPT-4o model which displayed superior recall. Conclusion: Fine-tuning GPT-4o and Llama3.1-70B, but not the smaller Llama3.1-8B, led to marked improvement in answering specific questions about research papers. The process we describe will be useful to researchers studying other medical domains.

Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

PMC-LLaMA: toward building open-source language models for medicine

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese

PMC-LLaMA: Towards Building Open-source Language Models for Medicine

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

Towards Expert-Level Medical Question Answering with Large Language Models

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Towards Building Multilingual Language Model for Medicine

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs

Fine-tuned large language models for answering questions about full-text biomedical research studies

Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine