DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs

Youwei Liang,Ruiyi Zhang,Li Zhang,Pengtao Xie
2023-05-19
Abstract:A ChatGPT-like system for drug compounds could be a game-changer in pharmaceutical research, accelerating drug discovery, enhancing our understanding of structure-activity relationships, guiding lead optimization, aiding drug repurposing, reducing the failure rate, and streamlining clinical trials. In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on drug molecule graphs, by developing a prototype system DrugChat. DrugChat works in a similar way as ChatGPT. Users upload a compound molecule graph and ask various questions about this compound. DrugChat will answer these questions in a multi-turn, interactive manner. The DrugChat system consists of a graph neural network (GNN), a large language model (LLM), and an adaptor. The GNN takes a compound molecule graph as input and learns a representation for this graph. The adaptor transforms the graph representation produced by the GNN into another representation that is acceptable to the LLM. The LLM takes the compound representation transformed by the adaptor and users' questions about this compound as inputs and generates answers. All these components are trained end-to-end. To train DrugChat, we collected instruction tuning datasets which contain 10,834 drug compounds and 143,517 question-answer pairs. The code and data is available at \url{<a class="link-external link-https" href="https://github.com/UCSD-AI4H/drugchat" rel="external noopener nofollow">this https URL</a>}
Biomolecules,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the development of a system similar to ChatGPT, specifically for the analysis of drug molecular maps, namely **DrugChat**. This system aims to accelerate the drug discovery process by providing instant, interactive analysis, enhancing the understanding of structure-activity relationships (SAR), guiding lead compound optimization, supporting drug repurposing, reducing failure rates, and simplifying clinical trials. Specifically, the paper points out that current drug discovery and development is a time-consuming and costly process, often taking years and billions of dollars to bring a single drug to market. Traditional methods often involve extensive iterative testing and have high late-stage failure rates. While recent advances in computational chemistry and cheminformatics have provided some relief, there is still an urgent need for tools that can intuitively understand and generate meaningful insights from complex molecular maps. Therefore, the development of DrugChat aims to: 1. **Accelerate drug discovery**: Significantly shorten the time required in the early stages of drug discovery by providing instant insights into potential therapeutic uses, side effects, and contraindications of drugs. 2. **Predict drug interactions**: By comparing the molecular structures of thousands of known substances, predict potential conflicts or synergistic effects between new candidate drugs and existing drugs, helping researchers better anticipate the performance of new drugs in practical applications. 3. **Understand structure-activity relationships (SAR)**: Help researchers understand the relationship between the chemical structure of drugs and their biological activity, and predict which chemical structure modifications can enhance their effects or reduce adverse side effects. 4. **Guide lead compound optimization**: Provide structural modification suggestions to improve efficacy, reduce toxicity, and enhance pharmacokinetic parameters during the drug discovery process, guiding researchers in the right direction and saving valuable time. 5. **Support drug repurposing**: By understanding the structural properties of existing drugs, identify candidate drugs that may be effective for diseases not initially targeted, bringing new life to existing drugs and providing faster pathways for treating challenging diseases. 6. **Reduce failure rates**: Help reduce late-stage failures due to unforeseen toxicity and efficacy issues by providing more accurate predictions about drug properties and effects early in the project. 7. **Simplify clinical trials**: Design more effective clinical trials by predicting drug interactions with other drugs or conditions, enabling researchers to target trials more effectively and recruit suitable patient populations. To achieve these goals, the DrugChat system is composed of Graph Neural Networks (GNN), Large Language Models (LLM), and adapters. The GNN is responsible for learning representations from drug molecular graphs, the adapter converts the graph representations generated by the GNN into a form acceptable to the LLM, and the LLM generates answers based on user queries. All these components are trained end-to-end, with the training dataset comprising 10,834 drug compounds and 143,517 question-answer pairs.