A Study of Generative Large Language Model for Medical Research and Healthcare

Cheng Peng,Xi Yang,Aokun Chen,Kaleb E Smith,Nima PourNejatian,Anthony B Costa,Cheryl Martin,Mona G Flores,Ying Zhang,Tanja Magoc,Gloria Lipori,Duane A Mitchell,Naykky S Ospina,Mustafa M Ahmed,William R Hogan,Elizabeth A Shenkman,Yi Guo,Jiang Bian,Yonghui Wu
DOI: https://doi.org/10.1038/s41746-023-00958-w
2023-05-23
Abstract:There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language processing for medical research. Synthetic NLP models trained using GatorTronGPT generated text outperform NLP models trained using real-world clinical text. Physicians Turing test using 1 (worst) to 9 (best) scale shows that there is no significant difference in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights on the opportunities and challenges of LLMs for medical research and healthcare.
Computation and Language
What problem does this paper attempt to address?
The main problem this paper attempts to address is exploring the potential and challenges of large language models (LLMs) in medical research and clinical care. Specifically: 1. **Developing a Clinical Generative LLM**: The paper developed a clinical generative large language model named GatorTronGPT, trained on 277 billion words of mixed clinical and English text, with a model parameter size of 20 billion. 2. **Evaluating GatorTronGPT's Performance**: GatorTronGPT's performance was evaluated on multiple benchmark datasets for biomedical relationship extraction and question-answering tasks, showing state-of-the-art performance on several tasks. 3. **Generating Synthetic Clinical Text**: GatorTronGPT was used to generate 20 billion words of synthetic clinical text, and these texts were used to train a synthetic natural language processing model (GatorTronS) to test whether the generated synthetic text could be used for clinical research. 4. **Turing Test**: By having doctors evaluate the differences between synthetic text and real clinical notes, the study validated the language readability and clinical relevance of the text generated by GatorTronGPT. The results showed that doctors could not significantly distinguish between synthetic and real text. 5. **Exploring the Prospects of LLMs in the Medical Field**: The paper discussed the potential applications of generative LLMs in the medical field, including generating clinical documentation, assisting in diagnosis, reducing doctors' paperwork burden, and pointed out the current limitations of the technology and future research directions. In summary, this paper aims to demonstrate the potential of generative large language models in medical research and clinical care by developing and evaluating GatorTronGPT, and to explore the opportunities and challenges in their practical application.