GcForest-based compound-protein interaction prediction model and its application in discovering small-molecule drugs targeting CD47

Wenying Shan,Lvqi Chen,Hao Xu,Qinghao Zhong,Yinqiu Xu,Hequan Yao,Kejiang Lin,Xuanyi Li
DOI: https://doi.org/10.3389/fchem.2023.1292869
IF: 5.5
2023-10-20
Frontiers in Chemistry
Abstract:Identifying compound–protein interaction plays a vital role in drug discovery. Artificial intelligence (AI), especially machine learning (ML) and deep learning (DL) algorithms, are playing increasingly important roles in compound-protein interaction (CPI) prediction. However, ML relies on learning from large sample data. And the CPI for specific target often has a small amount of data available. To overcome the dilemma, we propose a virtual screening model, in which word2vec is used as an embedding tool to generate low-dimensional vectors of SMILES of compounds and amino acid sequences of proteins, and the modified multi-grained cascade forest based gcForest is used as the classifier. This proposed method is capable of constructing a model from raw data, adjusting model complexity according to the scale of datasets, especially for small scale datasets, and is robust with few hyper-parameters and without over-fitting. We found that the proposed model is superior to other CPI prediction models and performs well on the constructed challenging dataset. We finally predicted 2 new inhibitors for clusters of differentiation 47(CD47) which has few known inhibitors. The IC 50 s of enzyme activities of these 2 new small molecular inhibitors targeting CD47-SIRPα interaction are 3.57 and 4.79 μM respectively. These results fully demonstrate the competence of this concise but efficient tool for CPI prediction.
chemistry, multidisciplinary
What problem does this paper attempt to address?