ChemGLaM: Chemical Genomics Language Models for Compound-Protein Interaction Prediction

Takuto Koyama,Hayato Tsumura,Shigeyuki Matsumoto,Ryunosuke Okita,Ryosuke Kojima,Yasushi Okuno
DOI: https://doi.org/10.1101/2024.02.13.580100
2024-02-20
Abstract:Accurate prediction of compound-protein interaction (CPI) is of great importance for drug discovery. For creating generalizable CPI prediction deep learning (DL) models, the expansion of CPI data through experimental validation is crucial. However, the cost associated with these experimental validations is a bottleneck. Recently developed large language models (LLMs) such as chemical language models (CLMs) and protein language models (PLMs) have emerged as foundation models, demonstrating high generalization performance in various tasks involving compounds and proteins. Inspired by this, we propose a chemical genomics language model, ChemGLaM, for predicting compound-protein interactions. ChemGLaM is based on the 2 independent language models, MoLFormer for compounds and ESM-2 for proteins, and fine-tuned for the CPI datasets using an interaction block with a cross-attention mechanism. ChemGLaM is capable of predicting interactions between unknown compounds and proteins with higher accuracy than existing CPI prediction models, demonstrating that combining the independently pre-trained foundation models is effective for obtaining sophisticated representation of compound-protein interactions. Furthermore, visualizing the learned cross-attention map can offer explainable insights into the mechanism of compound-protein interaction. This study emphasizes the potential of integrating the independent foundation models for the tasks of multi-modality such as CPI prediction.
Bioinformatics
What problem does this paper attempt to address?
The paper focuses on the prediction of compound-protein interactions (CPI), which is an important step in drug discovery. Due to the time-consuming and expensive nature of experimental verification of CPI, there is a need to develop computational methods to predict these interactions. The paper proposes a chemical genomics language model called ChemGLaM, which combines two independent pre-training models - MoLFormer for compounds and ESM-2 for proteins, and fine-tunes them using a cross-attention mechanism to predict the interactions between unknown compounds and proteins. The ChemGLaM model consists of three components: a compound encoder, a protein encoder, and an interaction block. It leverages self-supervised learning from a large-scale unlabeled dataset to enhance the generalization performance of the model, enabling training on limited CPI data. By visualizing the cross-attention maps, the model can provide explanatory insights into the mechanism of compound-protein interactions. The paper evaluates the performance of ChemGLaM using four different CPI datasets (BindingDB, Davis, PDBbind, and Metz), and compares it with various baseline models. The results show that ChemGLaM outperforms other models in both classification and regression tasks, demonstrating higher predictive accuracy, especially when dealing with unseen CPI. This indicates the effectiveness of combining pre-trained chemical and protein language models for multimodal tasks such as CPI prediction. In conclusion, this paper aims to address the challenges in CPI prediction by developing and applying the ChemGLaM model, improving the accuracy of predictions and providing interpretable understanding of interaction mechanisms.