Sequence-Order Frequency Matrix–Sampling and Machine Learning with Smith–Waterman (SOFM–SMSW) for Protein Remote Homology Detection

Sajithra, N.,Manikandan, P.
DOI: https://doi.org/10.1007/s11277-024-11617-y
IF: 2.017
2024-10-11
Wireless Personal Communications
Abstract:Protein remote homology detection (PRHD) is crucial for identifying proteins with similar functions and structures despite low sequence identity. Traditional methods, such as the Sequence-Order Frequency Matrix (SOFM), have faced challenges due to their high computational complexity. To address this issue, the Sequence-Order Frequency Matrix–Sampling and Machine learning with Smith–Waterman (SOFM–SMSW) algorithm is proposed to enhance PRHD efficiency and accuracy. The Proportional Volume Sampling (PVS) is introduced to prioritize important protein sequences which reduces computational complexity. After sampling the protein sequences, a feature vector is constructed and labeling is performed based on the concatenation between two protein sequences. Then, a substitution score which represents the structural alignment is learned using k-nearest neighbor (k-NN). Based on the learned substitution score and alignment score, the protein homology is detected using Smith–Waterman algorithm and Support Vector Machine (SVM). The proposed SOFM–SMSW algorithm is tested on the SCOP database and its performance is compared with existing methods including SVM Top-N-gram, SVM pairwise, GPkernel, Long Short-Term Memory (LSTM), SOFM Top-N-gram, and SOFM-SW. The experimental results demonstrate that SOFM–SMSW outperforms these methods, exhibiting superior accuracy, precision, recall, ROC, and ROC 50 metrics. These findings underscore the potential of the SOFM–SMSW algorithm to significantly advance protein remote homology detection, offering a more efficient and accurate solution to an important challenge in bioinformatics.
telecommunications
What problem does this paper attempt to address?