AI for Biomedicine in the Era of Large Language Models

Zhenyu Bi,Sajib Acharjee Dip,Daniel Hajialigol,Sindhura Kommu,Hanwen Liu,Meng Lu,Xuan Wang
2024-03-23
Abstract:The capabilities of AI for biomedicine span a wide spectrum, from the atomic level, where it solves partial differential equations for quantum systems, to the molecular level, predicting chemical or protein structures, and further extending to societal predictions like infectious disease outbreaks. Recent advancements in large language models, exemplified by models like ChatGPT, have showcased significant prowess in natural language tasks, such as translating languages, constructing chatbots, and answering questions. When we consider biomedical data, we observe a resemblance to natural language in terms of sequences: biomedical literature and health records presented as text, biological sequences or sequencing data arranged in sequences, or sensor data like brain signals as time series. The question arises: Can we harness the potential of recent large language models to drive biomedical knowledge discoveries? In this survey, we will explore the application of large language models to three crucial categories of biomedical data: 1) textual data, 2) biological sequences, and 3) brain signals. Furthermore, we will delve into large language model challenges in biomedical research, including ensuring trustworthiness, achieving personalization, and adapting to multi-modal data representation
Computation and Language
What problem does this paper attempt to address?
This paper discusses how to apply large language models (LLMs) such as ChatGPT to the field of biomedicine to promote knowledge discovery. The authors point out that biomedical data shares similarities with natural language in terms of sequencing, including textual data (such as medical literature and health records), biological sequences (such as DNA, RNA, and proteins), and brain signals (time series data). The paper aims to explore how LLMs can be used to process these three types of biomedical data and the challenges faced in their application, such as credibility, personalization, and adaptation to multimodal data. The paper provides a detailed introduction to various pre-training models for biomedical textual data, such as SciBERT, ClinicalBERT, BioBERT, BioMegatron, SciFive, PubMedBERT, BioLinkBERT, Galactica, BioGPT, DoT5, GatorTronGPT, and Med-PaLM 2, and discusses their potential in clinical applications (such as clinical treatment planning, report generation, and multi-agent collaboration) and research applications (such as information extraction and question-answering systems). Furthermore, the paper focuses on the application of LLMs in the analysis of biological sequences, especially DNA, RNA, protein, and multi-omics sequencing data. It mentions the importance of models such as Enformer, Nucleotide Transformer, GenSLMs, DNABERT, GENA-LM, and HyenaDNA in gene expression prediction, virus evolution analysis, transcription factor binding site identification, and other areas. In summary, the paper seeks to address the effective utilization of large language models to drive knowledge discovery in the field of biomedicine. By understanding and applying these models, the paper aims to overcome the complexity and diversity of data in order to improve diagnostic accuracy, disease prediction, and personalized healthcare.