Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

Hoang Thanh Lam,Marco Luca Sbodio,Marcos Martínez Galindo,Mykhaylo Zayats,Raúl Fernández-Díaz,Víctor Valls,Gabriele Picco,Cesar Berrospi Ramis,Vanessa López
2023-10-20
Abstract:Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results for drug-target binding affinity prediction in the established Therapeutic Data Commons (TDC) benchmarks. We release a set of multimodal knowledge graphs, integrating data from seven public data sources, and containing over 30 million triples. Our intention is to foster additional research to explore how multimodal knowledge enhanced protein/molecule embeddings can improve prediction tasks, including prediction of binding affinity. We also release some pretrained models learned from our multimodal knowledge graphs, along with source code for running standard benchmark tasks for prediction of biding affinity.
Machine Learning,Artificial Intelligence,Biomolecules
What problem does this paper attempt to address?
The problem this paper attempts to address is improving the accuracy of predicting the binding affinity between drug molecules and proteins. Specifically, existing studies typically learn representations from large databases (such as molecular SMILES representations and protein sequences) using unsupervised learning techniques, but these representations are often based on limited modalities and do not fully utilize the known relationships between molecules and proteins. This paper proposes a method to further enrich representations by integrating knowledge graphs from different sources and modalities into sequence or SMILES representations, achieving state-of-the-art results in the Therapeutic Data Commons (TDC) benchmark. ### Main Contributions: 1. **Release of Multimodal Knowledge Graphs**: The authors released several multimodal knowledge graphs that integrate data from seven public data sources, containing over 30 million triples. 2. **Pre-trained Models**: Provided pre-trained models and source code on these multimodal knowledge graphs for running standard binding affinity prediction benchmark tasks. 3. **Experimental Results**: Experimental results show that representations enhanced with multimodal knowledge outperform existing methods in predicting drug-protein interactions, even when most entities (such as molecules) in the test data are unseen in the training data and may only have one available modality (such as their SMILES representation). ### Method Overview: - **Multimodal Knowledge Graph Construction**: Constructed a multimodal knowledge graph framework that extracts and integrates data from various data sources, ensuring the uniqueness of each triple and automatically merging entities with the same unique identifiers. - **Initial Embedding Calculation**: Assigned pre-trained models to each modality (such as text, numbers, protein sequences, SMILES, etc.) to calculate initial embeddings. - **Inductive R-GNN Pre-training**: Used Graph Neural Networks (GNN) to propagate initial embeddings and improve representations through multiple layers of transformation. The GNN is inductive, allowing it to compute embeddings for unseen nodes using only available neighbor nodes and corresponding initial embeddings. - **Information Flow Control**: Explored the impact of controlling the information passed to drug/protein entities during pre-training. - **Noise Link Handling**: Investigated the impact of noise links in upstream data on downstream tasks. ### Experimental Results: - **Benchmark Testing**: Conducted experiments on three standard drug-target binding affinity prediction benchmark datasets (DTI DG, DAVIS, KIBA), showing that representations enhanced with multimodal knowledge significantly outperform baseline methods. - **Ensemble Learning**: Further improved prediction performance by ensembling multiple models pre-trained on different settings and knowledge graphs, achieving state-of-the-art levels. ### Conclusion: This paper significantly improves the accuracy of drug-protein binding affinity prediction by integrating multimodal knowledge graphs, particularly excelling in handling unseen entities. These results provide new tools and methods for future drug discovery research.