Abstract:Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results for drug-target binding affinity prediction in the established Therapeutic Data Commons (TDC) benchmarks. We release a set of multimodal knowledge graphs, integrating data from seven public data sources, and containing over 30 million triples. Our intention is to foster additional research to explore how multimodal knowledge enhanced protein/molecule embeddings can improve prediction tasks, including prediction of binding affinity. We also release some pretrained models learned from our multimodal knowledge graphs, along with source code for running standard benchmark tasks for prediction of biding affinity.

What problem does this paper attempt to address?

The problem this paper attempts to address is improving the accuracy of predicting the binding affinity between drug molecules and proteins. Specifically, existing studies typically learn representations from large databases (such as molecular SMILES representations and protein sequences) using unsupervised learning techniques, but these representations are often based on limited modalities and do not fully utilize the known relationships between molecules and proteins. This paper proposes a method to further enrich representations by integrating knowledge graphs from different sources and modalities into sequence or SMILES representations, achieving state-of-the-art results in the Therapeutic Data Commons (TDC) benchmark. ### Main Contributions: 1. **Release of Multimodal Knowledge Graphs**: The authors released several multimodal knowledge graphs that integrate data from seven public data sources, containing over 30 million triples. 2. **Pre-trained Models**: Provided pre-trained models and source code on these multimodal knowledge graphs for running standard binding affinity prediction benchmark tasks. 3. **Experimental Results**: Experimental results show that representations enhanced with multimodal knowledge outperform existing methods in predicting drug-protein interactions, even when most entities (such as molecules) in the test data are unseen in the training data and may only have one available modality (such as their SMILES representation). ### Method Overview: - **Multimodal Knowledge Graph Construction**: Constructed a multimodal knowledge graph framework that extracts and integrates data from various data sources, ensuring the uniqueness of each triple and automatically merging entities with the same unique identifiers. - **Initial Embedding Calculation**: Assigned pre-trained models to each modality (such as text, numbers, protein sequences, SMILES, etc.) to calculate initial embeddings. - **Inductive R-GNN Pre-training**: Used Graph Neural Networks (GNN) to propagate initial embeddings and improve representations through multiple layers of transformation. The GNN is inductive, allowing it to compute embeddings for unseen nodes using only available neighbor nodes and corresponding initial embeddings. - **Information Flow Control**: Explored the impact of controlling the information passed to drug/protein entities during pre-training. - **Noise Link Handling**: Investigated the impact of noise links in upstream data on downstream tasks. ### Experimental Results: - **Benchmark Testing**: Conducted experiments on three standard drug-target binding affinity prediction benchmark datasets (DTI DG, DAVIS, KIBA), showing that representations enhanced with multimodal knowledge significantly outperform baseline methods. - **Ensemble Learning**: Further improved prediction performance by ensembling multiple models pre-trained on different settings and knowledge graphs, achieving state-of-the-art levels. ### Conclusion: This paper significantly improves the accuracy of drug-protein binding affinity prediction by integrating multimodal knowledge graphs, particularly excelling in handling unseen entities. These results provide new tools and methods for future drug discovery research.

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

Ensembles of knowledge graph embedding models improve predictions for drug discovery

Drug target discovery using knowledge graph embeddings

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction

Multimodal contrastive representation learning for drug-target binding affinity prediction

KG-MTL: Knowledge Graph Enhanced Multi-Task Learning for Molecular Interaction

Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph

Biomedical Knowledge Graph Refinement and Completion using Graph Representation Learning and Top-K Similarity Measure

AttentionMGT-DTA: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism

A Knowledge-Enhanced Multi-View Framework for Drug-Target Interaction Prediction

G-K BertDTA: A graph representation learning and semantic embedding-based framework for drug-target affinity prediction

Multidta: drug-target binding affinity prediction via representation learning and graph convolutional neural networks

Understanding the Performance of Knowledge Graph Embeddings in Drug Discovery

Drug knowledge discovery via multi-task learning and pre-trained models

Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning

HeteroKGRep: Heterogeneous Knowledge Graph based Drug Repositioning

Distributed representations of graphs for drug pair scoring

Toward Unified AI Drug Discovery with Multimodal Knowledge

KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery

Deep learning of multimodal networks with topological regularization for drug repositioning

KG-Predict: A knowledge graph computational framework for drug repurposing