Abstract:Background: Few studies have explored the degree to which fine-tuning a large-language model (LLM) can improve its ability to answer a specific set of questions about a research study. Methods: We created an instruction set comprising 250 marked-down studies of HIV drug resistance, 16 questions per study, answers to each question, and explanations for each answer. The questions were broadly relevant to studies of pathogenic human viruses including whether a study reported viral genetic sequences and the demographics and antiviral treatments of the persons from whom sequences were obtained. We fine-tuned GPT-4o-mini (GPT-4o), Llama3.1-8B-Instruct (Llama3.1-8B), and Llama3.1-70B-Instruct (Llama3.1-70B) using a quantized low rank adapter (QLoRA). We assessed the accuracy, precision, and recall of each base and fine-tuned model in answering the same questions on a test set comprising 120 different studies. Paired t-tests and Wilcoxon signed-rank tests were used to compare base models to one another, fine-tuned models to their respective base model, and the fine-tuned models to one another. Results: Prior to fine-tuning, GPT-4o displayed significantly greater performance than both Llama3.1-70B and Llama3.1-8B due to its greater precision compared with Llama3.1-70B and greater precision and recall compared with Llama3.1-8B; there was no difference in performance between Llama3.1-70B and Llama3.1-8B. After fine-tuning, both GPT-4o and Llama3.1-70B, but not Llama3.1-8B, displayed significantly improved performance compared with their base models. The improved performance of GPT-4o resulted from a mean 6% increased precision and 9% increased recall; the improved performance of Llama3.1-70B resulted from a 15% increased precision. After fine-tuning, Llama3.1-70B significantly outperformed Llama3.1-8B but did not perform as well as the fine-tuned GPT-4o model which displayed superior recall. Conclusion: Fine-tuning GPT-4o and Llama3.1-70B, but not the smaller Llama3.1-8B, led to marked improvement in answering specific questions about research papers. The process we describe will be useful to researchers studying other medical domains.

Fine-tuning large neural language models for biomedical natural language processing

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

From pre-training to fine-tuning: An in-depth analysis of Large Language Models in the biomedical domain

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education

Fine-tuned large language models for answering questions about full-text biomedical research studies

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

A Fine-Tuned Large Language Model for Domain-Specific with Reinforcement Learning

Advancing entity recognition in biomedicine via instruction tuning of large language models

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text

Layer-wise Learning Rate Optimization for Task-Dependent Fine-Tuning of Pre-trained Models: An Evolutionary Approach

Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing

Does Biomedical Training Lead to Better Medical Performance?

Parameter-efficient fine-tuning of large-scale pre-trained language models