pLM-DBPs: Enhanced DNA-Binding Protein Prediction in Plants Using Embeddings From Protein Language Models

Suresh Pokharel,Kepha Barasa,Pawel Pratyush,Dukka B KC
DOI: https://doi.org/10.1101/2024.10.04.616755
2024-10-06
Abstract:DNA-binding proteins (DBPs) in plants play critical roles in gene regulation, development, and environmental response. While various machine learning and deep learning models have been developed to distinguish DBPs from non-DNA-binding proteins (NDBPs), most of the available tools have focused on human and mouse datasets, resulting in sub-optimal performance for plant-based DBP prediction. Developing an efficient framework for improving DBP prediction in plants would enable precise gene expression control, accelerate crop improvement, enhance stress resilience, and optimize metabolic engineering for agricultural advancement. To address this, we developed a tool that leverages a protein language model (pLM) pretrained on millions of sequences. We comprehensively evaluated several prominent protein language models, including ProtT5, Ankh, and ESM-2. By utilizing high-dimensional, information-rich representations from these models, our approach significantly enhances DBP prediction accuracy. Our final model, pLM-DBPs, a feed-forward neural network classifier utilizing ProtT5-based representations, outperformed existing approaches with a Matthews Correlation Coefficient (MCC) of 83.8% on the independent test set. This represents a 10% improvement over the previous state-of-the-art model for plant-based DBP prediction, highlighting its superior performance compared to other models.
Bioinformatics
What problem does this paper attempt to address?