Abstract:DNA-binding proteins (DBPs) play a significant role in all phases of genetic processes, including DNA recombination, repair, and modification. They are often utilized in drug discovery as fundamental elements of steroids, antibiotics, and anticancer drugs. Predicting them poses the most challenging task in proteomics research. Conventional experimental methods for DBP identification are costly and sometimes biased toward prediction. Therefore, developing powerful computational methods that can accurately and rapidly identify DBPs from sequence information is an urgent need. In this study, we propose a novel deep learning-based method called Deep-WET to accurately identify DBPs from primary sequence information. In Deep-WET, we employed three powerful feature encoding schemes containing Global Vectors, Word2Vec, and fastText to encode the protein sequence. Subsequently, these three features were sequentially combined and weighted using the weights obtained from the elements learned through the differential evolution (DE) algorithm. To enhance the predictive performance of Deep-WET, we applied the SHapley Additive exPlanations approach to remove irrelevant features. Finally, the optimal feature subset was input into convolutional neural networks to construct the Deep-WET predictor. Both cross-validation and independent tests indicated that Deep-WET achieved superior predictive performance compared to conventional machine learning classifiers. In addition, in extensive independent test, Deep-WET was effective and outperformed than several state-of-the-art methods for DBP prediction, with accuracy of 78.08%, MCC of 0.559, and AUC of 0.805. This superior performance shows that Deep-WET has a tremendous predictive capacity to predict DBPs. The web server of Deep-WET and curated datasets in this study are available at https://deepwet-dna.monarcatechnical.com/. The proposed Deep-WET is anticipated to serve the community-wide effort for large-scale identification of potential DBPs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the prediction of DNA - binding proteins (DBPs). DNA - binding proteins play important roles in all stages of the genetic process, such as DNA recombination, repair and modification, and are often used as basic elements of steroids, antibiotics and anticancer drugs in drug discovery. However, traditional experimental methods for identifying DNA - binding proteins are both expensive and potentially biased. Therefore, there is an urgent need to develop powerful computational methods that can quickly and accurately identify DNA - binding proteins from sequence information. To solve this problem, the authors proposed a deep - learning - based method, Deep - WET, for accurately identifying DNA - binding proteins from the primary sequence information of proteins. This method adopts three powerful feature encoding schemes, namely Global Vectors, Word2Vec and fastText, to encode protein sequences. Subsequently, these features are weighted and combined by the weights obtained through the Differential Evolution (DE) algorithm. To further improve the prediction performance of Deep - WET, the researchers applied the SHapley Additive exPlanations (SHAP) method to remove irrelevant features. Finally, the optimized feature subset is input into a Convolutional Neural Network (CNN) to construct the Deep - WET predictor. Through cross - validation and independent tests, the results show that Deep - WET exhibits superior prediction performance compared to traditional machine - learning classifiers. In addition, the effectiveness of Deep - WET has also been confirmed in extensive independent tests, with an accuracy of 78.08%, a Matthews Correlation Coefficient (MCC) of 0.559, and an Area Under the Curve (AUC) of 0.805, indicating that Deep - WET has a strong predictive ability to predict DNA - binding proteins.

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features

DNA-binding protein prediction based on deep transfer learning

DeepDBS: Identification of DNA-binding sites in protein sequences by using deep representations and random forest

Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins

DNA Binding Protein Prediction based on Multi-feature Deep Metatransfer Learning

Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm

Prediction of DNA binding proteins using local features and long-term dependencies with primary sequences based on deep learning

DeepPWM-BindingNet: Unleashing Binding Prediction with Combined Sequence and PWM Features

DRBpred: A sequence-based machine learning method to effectively predict DNA- and RNA-binding residues

Protein-DNA Binding Residues Prediction Using a Deep Learning Model with Hierarchical Feature Extraction

LGC-DBP: the method of DNA-binding protein identification based on PSSM and deep learning

TargetDBP: Accurate DNA-Binding Protein Prediction Via Sequence-Based Multi-View Feature Learning

TargetDBP+: Enhancing the Performance of Identifying DNA-Binding Proteins via Weighted Convolutional Features

PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks

ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins

Improved prediction of DNA and RNA binding proteins with deep learning models

pLM-DBPs: Enhanced DNA-Binding Protein Prediction in Plants Using Embeddings From Protein Language Models

Predicting the sequence specificities of DNA-binding proteins by DNA Fine-tuned Language Model with decaying learning rates

Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery