Abstract:Background: The identification of compound-protein interactions (CPIs) is crucial for drug discovery and understanding mechanisms of action. Accurate CPI prediction can elucidate drug-target-disease interactions, aiding in the discovery of candidate compounds and effective synergistic drugs, particularly from traditional Chinese medicine (TCM). Existing in silico methods face challenges in prediction accuracy and generalization due to compound and target diversity and the lack of largescale interaction datasets and negative datasets for model learning. Methods: To address these issues, we developed a computational model for CPI prediction by integrating the constructed large-scale bioactivity benchmark dataset with a deep learning (DL) algorithm. To verify the accuracy of our CPI model, we applied it to predict the targets of compounds in TCM. An herb pair of Astragalus membranaceus and Hedyotis diffusaas was used as a model, and the active compounds in this herb pair were collected from various public databases and the literature. The complete targets of these active compounds were predicted by the CPI model, resulting in an expanded target dataset. This dataset was next used for the prediction of synergistic antitumor compound combinations. The predicted multi-compound combinations were subsequently examined through in vitro cellular experiments. Results: Our CPI model demonstrated superior performance over other machine learning models, achieving an area under the Receiver Operating Characteristic curve (AUROC) of 0.98, an area under the precision-recall curve (AUPR) of 0.98, and an accuracy (ACC) of 93.31% on the test set. The model's generalization capability and applicability were further confirmed using external databases. Utilizing this model, we predicted the targets of compounds in the herb pair of Astragalus membranaceus and Hedyotis diffusaas, yielding an expanded target dataset. Then, we integrated this expanded target dataset to predict effective drug combinations using our drug synergy prediction model DeepMDS. Experimental assay on breast cancer cell line MDA-MB-231 proved the efficacy of the best predicted multi-compound combinations: Combination I (Epicatechin, Ursolic acid, Quercetin, Aesculetin and Astragaloside IV) exhibited a half-maximal inhibitory concentration (IC50) value of 19.41 μM, and a combination index (CI) value of 0.682; and Combination II (Epicatechin, Ursolic acid, Quercetin, Vanillic acid and Astragaloside IV) displayed a IC50 value of 23.83 μM and a CI value of 0.805. These results validated the ability of our model to make accurate predictions for novel CPI data outside the training dataset and evaluated the reliability of the predictions, showing good applicability potential in drug discovery and in the elucidation of the bioactive compounds in TCM. Conclusion: Our CPI prediction model can serve as a useful tool for accurately identifying potential CPI for a wide range of proteins, and is expected to facilitate drug research, repurposing and support the understanding of TCM.

ChemGLaM: Chemical Genomics Language Models for Compound-Protein Interaction Prediction

CPGL: Prediction of Compound-Protein Interaction by Integrating Graph Attention Network With Long Short-Term Memory Neural Network

MMCL-CPI: A multi-modal compound-protein interaction prediction model incorporating contrastive learning pre-training

MDL-CPI: Multi-view deep learning model for compound-protein interaction prediction

A deep learning method for predicting molecular properties and compound-protein interactions

PLM-interact: extending protein language models to predict protein-protein interactions

A bidirectional interpretable compound-protein interaction prediction framework based on cross attention

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

MCPI: Integrating Multimodal Data for Enhanced Prediction of Compound Protein Interactions

Chemical Language Models for Molecular Design

FMGNN: A Method to Predict Compound-Protein Interaction with Pharmacophore Features and Physicochemical Properties of Amino Acids

SSGraphCPI: A Novel Model for Predicting Compound-Protein Interactions Based on Deep Learning

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

MGPLI: Exploring Multigranular Representations for Protein-Ligand Interaction Prediction

MulinforCPI: enhancing precision of compound–protein interaction prediction through novel perspectives on multi-level information integration

A general prediction model for compound-protein interactions based on deep learning

MGCPI: A Multi-granularity Neural Network for Predicting Compound-Protein Interactions.

PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Protein-Protein Interaction Prediction is Achievable with Large Language Models