TDFFM: Transformer and Deep Forest Fusion Model for Predicting Coronavirus 3C-Like Protease Cleavage Sites
Qingsong Wang,Ruiquan Ge,Changmiao Wang,Ahmed Elazab,Qiming Fang,Renfeng Zhang
DOI: https://doi.org/10.1109/tcbb.2024.3378470
2024-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:COVID-19, caused by the highly contagious SARS-CoV-2 virus, is distinguished by its positive-sense, single-stranded RNA genome. A thorough understanding of SARS-CoV-2 pathogenesis is crucial for halting its proliferation. Notably, the 3C- like protease of the coronavirus (denoted as 3CL<sup>pro</sup>) is instrumental in the viral replication process. Precise delineation of 3CL<sup>pro</sup> cleavage sites is imperative for elucidating the transmission dynamics of SARS-CoV-2. While machine learning tools have been deployed to identify potential 3CL<sup>pro</sup> cleavage sites, these existing methods often fall short in terms of accuracy. To improve the performances of these predictions, we propose a novel analytical framework, the Transformer and Deep Forest Fusion Model (TDFFM). Within TDFFM, we utilize the AAindex and the BLOSUM62 matrix to encode protein sequences. These encoded features are subsequently input into two distinct components: a Deep Forest, which is an effective decision tree ensemble methodology, and a Transformer equipped with a Multi-Level Attention Model (TMLAM). The integration of the attention mechanism allows our model to more accurately identify positive samples, thus enhancing the overall predictive performance. Evaluation on a test set demonstrates that our TDFFM achieves an accuracy of 0.955, an AUC of 0.980, and an F1-score of 0.367, substantiating the model's superior prediction capabilities.
computer science, interdisciplinary applications,biochemical research methods,mathematics,statistics & probability