A Computational Model to Identify Fertility-Related Proteins Using Sequence Information

Yan Lin,Jiashu Wang,Xiaowei Liu,Xueqin Xie,De Wu,Junjie Zhang,Hui Ding
DOI: https://doi.org/10.1007/s11704-022-2559-6
IF: 2.6688
2023-01-01
Frontiers of Computer Science
Abstract:Fertility is the most crucial step in the development process, which is controlled by many fertility-related proteins, including spermatogenesis-, oogenesis- and embryogenesis-related proteins. The identification of fertility-related proteins can provide important clues for studying the role of these proteins in development. Therefore, in this study, we constructed a two-layer classifier to identify fertility-related proteins. In this classifier, we first used the composition of amino acids (AA) and their physical and chemical properties to code these three fertility-related proteins. Then, the feature set is optimized by analysis of variance (ANOVA) and incremental feature selection (IFS) to obtain the optimal feature subset. Through five-fold cross-validation (CV) and independent data tests, the performance of models constructed by different machine learning (ML) methods is evaluated and compared. Finally, based on support vector machine (SVM), we obtained a two-layer model to classify three fertility-related proteins. On the independent test data set, the accuracy (ACC) and the area under the receiver operating characteristic curve (AUC) of the first layer classifier are 81.95% and 0.89, respectively, and them of the second layer classifier are 84.74% and 0.90, respectively. These results show that the proposed model has stable performance and satisfactory prediction accuracy, and can become a powerful model to identify more fertility related proteins.
What problem does this paper attempt to address?