Abstract:BACKGROUND:Ligand-binding proteins play key roles in many biological processes. Identification of protein-ligand binding residues is important in understanding the biological functions of proteins. Existing computational methods can be roughly categorized as sequence-based or 3D-structure-based methods. All these methods are based on traditional machine learning. In a series of binding residue prediction tasks, 3D-structure-based methods are widely superior to sequence-based methods. However, due to the great number of proteins with known amino acid sequences, sequence-based methods have considerable room for improvement with the development of deep learning. Therefore, prediction of protein-ligand binding residues with deep learning requires study.RESULTS:In this study, we propose a new sequence-based approach called DeepCSeqSite for ab initio protein-ligand binding residue prediction. DeepCSeqSite includes a standard edition and an enhanced edition. The classifier of DeepCSeqSite is based on a deep convolutional neural network. Several convolutional layers are stacked on top of each other to extract hierarchical features. The size of the effective context scope is expanded as the number of convolutional layers increases. The long-distance dependencies between residues can be captured by the large effective context scope, and stacking several layers enables the maximum length of dependencies to be precisely controlled. The extracted features are ultimately combined through one-by-one convolution kernels and softmax to predict whether the residues are binding residues. The state-of-the-art ligand-binding method COACH and some of its submethods are selected as baselines. The methods are tested on a set of 151 nonredundant proteins and three extended test sets. Experiments show that the improvement of the Matthews correlation coefficient (MCC) is no less than 0.05. In addition, a training data augmentation method that slightly improves the performance is discussed in this study.CONCLUSIONS:Without using any templates that include 3D-structure data, DeepCSeqSite significantlyoutperforms existing sequence-based and 3D-structure-based methods, including COACH. Augmentation of the training sets slightly improves the performance. The model, code and datasets are available at https://github.com/yfCuiFaith/DeepCSeqSite .

Exploring Protein-DNA Binding Residue Prediction and Consistent Interpretability Analysis Using Deep Learning

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

Protein-DNA Binding Residues Prediction Using a Deep Learning Model with Hierarchical Feature Extraction

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction

PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models

Predicting DNA structure using a deep learning method

ULDNA: Integrating Unsupervised Multi-Source Language Models with LSTM-Attention Network for Protein-DNA Binding Site Prediction

Advancing Protein-DNA Binding Site Prediction: Integrating Sequence Models and Machine Learning Classifiers

Structure-based Prediction of Nucleic Acid Binding Residues by Merging Deep Learning- and Template-Based Approaches.

Predicting Protein-Ligand Binding Residues with Deep Convolutional Neural Networks

High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method

Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning

A Novel Sequence-Based Method of Predicting Protein DNA-Binding Residues, Using a Machine Learning Approach

Predicting Protein-Peptide Binding Residues Via Interpretable Deep Learning.

LGC-DBP: the method of DNA-binding protein identification based on PSSM and deep learning

Computational methods for DNA-binding protein and binding residue prediction.

Accurate nucleic acid-binding residue identification based on domain-adaptive protein language model and explainable geometric deep learning

Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning

Dnabind: A Hybrid Algorithm For Structure-Based Prediction Of Dna-Binding Residues By Combining Machine Learning- And Template-Based Approaches

SAResNet: self-attention residual network for predicting DNA-protein binding

An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences