Domain-specific LLM Development and Evaluation – A Case-study for Prostate Cancer

Amara Tariq,Man Luo,Aisha Urooj,Avisha Das,Jiwoong Jeong,Shubham Trivedi,Bhavik Patel,Imon Banerjee
DOI: https://doi.org/10.1101/2024.03.15.24304362
2024-03-19
Abstract:In this work, we present our strategy for developing domain-specific large language models which cover the vocabulary of the target domain and train on reliable sources of clinical information. Prostate cancer was chosen as a use-case for this study. We collected more than 1.8 million clinical notes and radiology and pathology reports for 15341 patients treated for prostate cancer in Mayo Clinic across three sites and outpatient clinics. In addition to domain-specific training data, we built domain-specific tokenizers and devised knowledge-guided training strategies for LLM development. During the self-supervised training, LLM was forced to predict domain-specific information by marking clinical terms using UMLS parser. We evaluated the model for downstream tasks of clinical information prediction and question answering using quantitative and user evaluation study to measure the accuracy, reliability and information completeness. We compared the domain-specific model against similarly sized general purpose model GPT-2 and a three-times larger domain specialized model. i.e., BioGPT. Our model outperformed GPT-2 on both tasks by a wide margin. Our model was also able to outperform BioGPT on clinical information prediction tasks and showed some advantages over BioGPT in question-answering tasks.
Oncology
What problem does this paper attempt to address?
This paper discusses how to develop and evaluate large-scale language models (LLMs) in the medical field, particularly for prostate cancer. The research team collected a total of 1.8 million clinical notes, radiology reports, and pathology reports from over 15,000 prostate cancer patients to train a specialized LLM for prostate cancer. They designed a domain-specific dictionary and knowledge-guided training strategy, and during the training process, they had the model predict clinical terms to enhance its understanding of professional information. The paper demonstrates through comparative experiments that their model outperforms a general model of the same scale, GPT-2, as well as a domain-specific model, BioGPT, which is three times larger, in clinical information prediction and question-answering tasks. The study emphasizes the advantages of domain-specific models in handling clinical decisions and information dissemination, as they can better understand and generate disease-related information, reducing the risk of incorrect or misleading answers. Furthermore, they developed a specific vocabulary and self-supervised training method for prostate cancer, ensuring that the model can learn the subtle differences in the field. In user evaluations, their model demonstrates outstanding performance in accuracy, completeness, and relevance, indicating its superiority in handling prostate cancer information. In conclusion, this paper aims to address how to build and optimize large-scale language models for specific medical domains to improve accuracy, reliability, and information completeness in downstream tasks such as medical information processing and patient education.