Abstract:Background Public resources of chemical compound are in a rapid growth both in quantity and the types of data-representation. To comprehensively understand the relationship between the intrinsic features of chemical compounds and protein targets is an essential task to evaluate potential protein-binding function for virtual drug screening. In previous studies, correlations were proposed between bioactivity profiles and target networks, especially when chemical structures were similar. With the lack of effective quantitative methods to uncover such correlation, it is demanding and necessary for us to integrate the information from multiple data sources to produce an comprehensive assessment of the similarity between small molecules, as well as quantitatively uncover the relationship between compounds and their targets by such integrated schema. Results In this study a multi-view based clustering algorithm was introduced to quantitatively integrate compound similarity from both bioactivity profiles and structural fingerprints. Firstly, a hierarchy clustering was performed with the fused similarity on 37 compounds curated from PubChem. Compared to clustering in a single view, the overall common target number within fused classes has been improved by using the integrated similarity, which indicated that the present multi-view based clustering is more efficient by successfully identifying clusters with its members sharing more number of common targets. Analysis in certain classes reveals that mutual complement of the two views for compound description helps to discover missing similar compound when only single view was applied. Then, a large-scale drug virtual screen was performed on 1267 compounds curated from Connectivity Map (CMap) dataset based on the fused similarity, which obtained a better ranking result compared to that of single-view. These comprehensive tests indicated that by combining different data representations; an improved assessment of target-specific compound similarity can be achieved. Conclusions Our study presented an efficient, extendable and quantitative computational model for integration of different compound representations, and expected to provide new clues to improve the virtual drug screening from various pharmacological properties. Scripts, supplementary materials and data used in this study are publicly available at http://lifecenter.sgst.cn/fusion/ .

A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds

A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds

Dcdb: Drug Combination Database

ChEMBL: a large-scale bioactivity database for drug discovery

The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods

Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents

The ChEMBL database in 2017

Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses

Orthologue chemical space and its influence on target prediction

Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis

MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning

Polypharmacology Directed Compound Data Mining: Identification of Promiscuous Chemotypes with Different Activity Profiles and Comparison to Approved Drugs

The ChEMBL bioactivity database: an update

COMET:Combined Matrix for Elucidating Targets

Quantitatively integrating molecular structure and bioactivity profile evidence into drug-target relationship analysis

A large dataset curation and benchmark for drug target interaction

Finding the most potent compounds using active learning on molecular pairs

Identification of bioactive compounds with popular single-atom modifications: Comprehensive analysis and implications for compound design

MolData, a molecular benchmark for disease and target based machine learning

Attention-based approach to predict drug-target interactions across seven target superfamilies

Cross‐Mapping of Protein – Ligand Binding Data Between ChEMBL and PDBbind