Abstract:Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.

DBPPred-PDSD: Machine Learning Approach for Prediction of DNA-binding Proteins Using Discrete Wavelet Transform and Optimized Integrated Features Space

SDBP-Pred: Prediction of Single-Stranded and Double-Stranded DNA-binding Proteins by Extending Consensus Sequence and K-segmentation Strategies into PSSM.

Improved Detection of DNA-binding Proteins Via Compression Technology on PSSM Information

Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features

Efficient Prediction of DNA-Binding Proteins Using Machine Learning

DRBpred: A sequence-based machine learning method to effectively predict DNA- and RNA-binding residues

Newdna-Prot: Prediction of DNA-binding Proteins by Employing Support Vector Machine and a Comprehensive Sequence Representation.

Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks

Sequence-based Detection of DNA-binding Proteins Using Multiple-View Features Allied with Feature Selection

DeepDNAbP: A Deep Learning-Based Hybrid Approach to Improve the Identification of Deoxyribonucleic Acid-Binding Proteins.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.

DBPboost:A Method of Classification of DNA-binding Proteins Based on Improved Differential Evolution Algorithm and Feature Extraction

Protein-DNA Binding Residue Prediction Via Bagging Strategy and Sequence-Based Cube-Format Feature

DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins.

Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm

PredPSD: A Gradient Tree Boosting Approach for Single-Stranded and Double-Stranded DNA Binding Protein Prediction.

Advancing Protein-DNA Binding Site Prediction: Integrating Sequence Models and Machine Learning Classifiers

PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks

Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation

Local-DPP: an Improved DNA-binding Protein Prediction Method by Exploring Local Evolutionary Information.