CancerLLM: A Large Language Model in Cancer Domain

Mingchen Li,Jiatan Huang,Jeremy Yeung,Anne Blaes,Steven Johnson,Hongfang Liu,Hua Xu,Rui Zhang
2024-09-01
Abstract:Medical Large Language Models (LLMs) such as ClinicalCamel 70B, Llama3-OpenBioLLM 70B have demonstrated impressive performance on a wide variety of medical NLP task.However, there still lacks a large language model (LLM) specifically designed for cancer domain. Moreover, these LLMs typically have billions of parameters, making them computationally expensive for healthcare systems.Thus, in this study, we propose CancerLLM, a model with 7 billion parameters and a Mistral-style architecture, pre-trained on 2,676,642 clinical notes and 515,524 pathology reports covering 17 cancer types, followed by fine-tuning on three cancer-relevant tasks, including cancer phenotypes extraction, and cancer diagnosis generation. Our evaluation demonstrated that CancerLLM achieves state-of-the-art results compared to other existing LLMs, with an average F1 score improvement of 7.61 %. Additionally, CancerLLM outperforms other models on two proposed robustness testbeds. This illustrates that CancerLLM can be effectively applied to clinical AI systems, enhancing clinical research and healthcare delivery in the field of cancer.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the current shortcomings of large language models (LLMs) in the field of cancer. Specifically, the paper points out: 1. **Lack of specialized large language models for the field of cancer**: Existing medical LLMs such as ClinicalCamel 70B, Llama3-OpenBioLLM 70B, etc., although performing well in a wide range of medical natural language processing tasks, lack knowledge specifically targeted at the field of cancer. This limits their application in cancer diagnosis and treatment planning. 2. **Limitations of computational resources**: Existing large LLMs usually have a huge number of parameters, reaching tens of billions or even hundreds of billions, making them difficult to deploy and use in medical institutions with limited computational resources. Therefore, there is a need to develop a model with a smaller number of parameters but excellent performance to reduce computational demands, benefiting more medical professionals and patients. To address these issues, the paper proposes CancerLLM, a large language model with 7 billion parameters, specifically pre-trained and fine-tuned for the field of cancer. The main goal of CancerLLM is to improve the model's performance in cancer phenotype extraction and cancer diagnosis generation tasks, and to ensure its effective operation in environments with limited computational resources.