SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Yanwen Huang,Bowen Gao,Yinjun Jia,Hongbo Ma,Wei-Ying Ma,Ya-Qin Zhang,Yanyan Lan
2024-06-13
Abstract:Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
This paper mainly focuses on the prediction of the biological activity of small molecule-protein interactions. The existing datasets have limitations in terms of scale and systematicity, lacking high-quality three-dimensional structure data and comprehensive biological activity labels, which hinder the in-depth understanding and accurate prediction of these interactions. To address this issue, researchers have constructed a large-scale structure-based small molecule-protein interaction dataset called SIU, which contains over 1 million binding structures, each with a true biological activity label. The characteristics of the SIU dataset include: 1. Large-scale: Over 5.34 million conformations and 1.38 million rigorously annotated biological activity labels. 2. Diversity: Covering 214,686 distinct small molecules and 1,720 unique protein targets, including various active and inactive molecules, as well as multiple types of proteins. 3. High quality: Multiple software docking and consensus filtering methods were employed to ensure the accuracy of the structural data. 4. Well-organized: Organized systemically based on PDB ID and biological activity type, facilitating unbiased biological activity prediction. By evaluating classical models on the SIU dataset, the study found that unbiased biological activity prediction tasks are challenging but crucial. Compared to the commonly used PDBbind dataset, SIU can improve model performance and emphasize the importance of distinguishing between different molecular activities in protein pockets. In addition, the paper also discusses the issue of mixing different types of biological activity in existing datasets, pointing out that these types should be treated separately based on their unique properties. Experimental results show that there are significant differences between different biological activity types such as IC50, EC50, Ki, and Kd, and they cannot be simply substituted or merged. In summary, the introduction of the SIU dataset aims to promote unbiased biological activity prediction and provide a more accurate and comprehensive foundation for drug discovery research.