Tx-LLM: A Large Language Model for Therapeutics

Juan Manuel Zambrano Chaves,Eric Wang,Tao Tu,Eeshit Dhaval Vaishnav,Byron Lee,S. Sara Mahdavi,Christopher Semturs,David Fleet,Vivek Natarajan,Shekoofeh Azizi

2024-06-10

Abstract:Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Computation and Language,Artificial Intelligence,Computational Engineering, Finance, and Science,Machine Learning

What problem does this paper attempt to address?

This paper introduces Tx-LLM, a large-scale language model specifically designed for multiple therapeutic modalities in drug development. Current AI methods mostly deal with specific tasks, while Tx-LLM, by fine-tuning from PaLM-2, is able to encode diverse knowledge of therapeutic modalities, handling various biochemical entities including small molecules, proteins, nucleic acids, cell lines, and relevant free-text, predicting various related properties. The model achieves or approaches state-of-the-art performance on 43 out of 66 tasks and outperforms the state-of-the-art on 22 tasks, particularly in tasks that combine molecular SMILES representation and text such as cell line names or disease names. The training dataset of Tx-LLM contains 709 datasets from 66 tasks, covering different stages of the drug discovery process. The study also finds positive transfer across tasks of different drug types, such as small molecules and protein-related tasks, and the model size, domain fine-tuning, and prompt strategy significantly impact performance. The authors believe that Tx-LLM is an important step towards inclusive LLM with biochemical knowledge and may become an end-to-end tool in the drug discovery pipeline in the future.

Tx-LLM: A Large Language Model for Therapeutics

Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials

Emerging opportunities of using large language models for translation between drug molecules and indications

Y-Mol: A Multiscale Biomedical Knowledge-Guided Large Language Model for Drug Development

Large language models reshaping molecular biology and drug development

Demystifying Large Language Models for Medicine: A Primer

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Model-Based Natural Language Encoding Could Be All You Need for Drug Biomedical Association Prediction

Large language models for science and medicine

Utilizing Large Language Models for Natural Interface to Pharmacology Databases

PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Large language models for biomedicine: foundations, opportunities, challenges, and best practices

Large Language Models in Medicine: The Potentials and Pitfalls

CancerLLM: A Large Language Model in Cancer Domain

Toward an Explainable Large Language Model for the Automatic Identification of the Drug-Induced Liver Injury Literature

Large language models encode clinical knowledge

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

DrugAssist: A Large Language Model for Molecule Optimization

A Large Language Model Pipeline for Breast Cancer Oncology