SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Yanwen Huang,Bowen Gao,Yinjun Jia,Hongbo Ma,Wei-Ying Ma,Ya-Qin Zhang,Yanyan Lan

2024-06-13

Abstract:Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.

Biomolecules,Machine Learning

What problem does this paper attempt to address?

This paper mainly focuses on the prediction of the biological activity of small molecule-protein interactions. The existing datasets have limitations in terms of scale and systematicity, lacking high-quality three-dimensional structure data and comprehensive biological activity labels, which hinder the in-depth understanding and accurate prediction of these interactions. To address this issue, researchers have constructed a large-scale structure-based small molecule-protein interaction dataset called SIU, which contains over 1 million binding structures, each with a true biological activity label. The characteristics of the SIU dataset include: 1. Large-scale: Over 5.34 million conformations and 1.38 million rigorously annotated biological activity labels. 2. Diversity: Covering 214,686 distinct small molecules and 1,720 unique protein targets, including various active and inactive molecules, as well as multiple types of proteins. 3. High quality: Multiple software docking and consensus filtering methods were employed to ensure the accuracy of the structural data. 4. Well-organized: Organized systemically based on PDB ID and biological activity type, facilitating unbiased biological activity prediction. By evaluating classical models on the SIU dataset, the study found that unbiased biological activity prediction tasks are challenging but crucial. Compared to the commonly used PDBbind dataset, SIU can improve model performance and emphasize the importance of distinguishing between different molecular activities in protein pockets. In addition, the paper also discusses the issue of mixing different types of biological activity in existing datasets, pointing out that these types should be treated separately based on their unique properties. Experimental results show that there are significant differences between different biological activity types such as IC50, EC50, Ki, and Kd, and they cannot be simply substituted or merged. In summary, the introduction of the SIU dataset aims to promote unbiased biological activity prediction and provide a more accurate and comprehensive foundation for drug discovery research.

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

mPPI: a database extension to visv-ilize structural interactome in a one-to-many manner

MISATO - Machine Learning Dataset for Structure-Based Drug Discovery

BindingGYM: A Large-Scale Mutational Dataset Toward Deciphering Protein-Protein Interactions

Advancing Bioactivity Prediction through Molecular Docking and Self-Attention

In-silico Target Prediction by Ensemble Chemogenomic Model Based on Multi-Scale Information of Chemical Structures and Protein Sequences.

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

A High-Quality Data Set of Protein-Ligand Binding Interactions Via Comparative Complex Structure Modeling

A large dataset curation and benchmark for drug target interaction

Development of QSAR-Improved Statistical Potential for the Structure-Based Analysis of ProteinPeptide Binding Affinities

Quantitatively integrating molecular structure and bioactivity profile evidence into drug-target relationship analysis

MolBiC: the cell-based landscape illustrating molecular bioactivities

Synergizing Chemical Structures and Bioassay Descriptions for Enhanced Molecular Property Prediction in Drug Discovery

Prediction and collection of protein–metabolite interactions

Domain-based small molecule binding site annotation

Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis

SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction

Prediction of Bioactive Compound Pathways Using Chemical Interaction and Structural Information.

DrugMGR: a deep bioactive molecule binding method to identify compounds targeting proteins

In silico prediction of drug-target interaction networks based on drug chemical structure and protein sequences

Benchmark Study Based on 2P2I(DB) to Gain Insights into the Discovery of Small-Molecule PPI Inhibitors