Benchmarking Recent Computational Tools for DNA-binding Protein Identification

Xizi Luo,Andre Lin,Song Chi,Limsoon Wong,Chowdhury Rafeed Rahman
DOI: https://doi.org/10.1101/2024.09.01.610735
2024-09-03
Abstract:Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control and various cellular processes. In this paper, we conduct an unbiased benchmarking of nine state-of-the-art computational tools as well as traditional tools such as ScanProsite and BLAST for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques and training methods; and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available publicly via GitHub.
Bioinformatics
What problem does this paper attempt to address?