Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning

Yufan Liu,Boxue Tian
DOI: https://doi.org/10.1093/bib/bbad488
IF: 9.5
2024-01-05
Briefings in Bioinformatics
Abstract:Protein–DNA interaction is critical for life activities such as replication, transcription and splicing. Identifying protein–DNA binding residues is essential for modeling their interaction and downstream studies. However, developing accurate and efficient computational methods for this task remains challenging. Improvements in this area have the potential to drive novel applications in biotechnology and drug design. In this study, we propose a novel approach called Contrastive Learning And Pre-trained Encoder (CLAPE), which combines a pre-trained protein language model and the contrastive learning method to predict DNA binding residues. We trained the CLAPE-DB model on the protein–DNA binding sites dataset and evaluated the model performance and generalization ability through various experiments. The results showed that the area under ROC curve values of the CLAPE-DB model on the two benchmark datasets reached 0.871 and 0.881, respectively, indicating superior performance compared to other existing models. CLAPE-DB showed better generalization ability and was specific to DNA-binding sites. In addition, we trained CLAPE on different protein–ligand binding sites datasets, demonstrating that CLAPE is a general framework for binding sites prediction. To facilitate the scientific community, the benchmark datasets and codes are freely available at https://github.com/YAndrewL/clape.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of predicting protein - DNA binding sites. Specifically, identifying the binding sites between proteins and DNA is crucial for understanding their interactions and downstream research. However, there are still many difficulties in developing computational methods that are both accurate and efficient to accomplish this task. The performance of existing methods in practical applications is still not satisfactory, and the feature extraction process often depends on manual design, which limits the performance improvement of the model. Therefore, this paper proposes a new method - Contrastive Learning And Pre - trained Encoder (CLAPE), aiming to predict DNA - binding residues by combining pre - trained protein language models and contrastive learning methods, thereby improving the prediction accuracy and efficiency and reducing the dependence on manually - designed features. Through experimental verification, the CLAPE - DB model performs better than other existing models on two benchmark datasets, especially showing excellent performance in generalization ability. In addition, CLAPE has also been proven to be a general framework for predicting other ligand - binding sites based on protein sequence information, providing new ideas and tools for future related research.