Improved Protein Secondary Structure Prediction Using Bidirectional Long Short-Term Memory Neural Network and Bootstrap Aggregating

Wen-Wu Zeng,Ning-Xin Jia,Jun Hu
DOI: https://doi.org/10.1109/icbcb55259.2022.9802482
2022-01-01
Abstract:Accurate predicting protein secondary structure information is essential to identify structural classes, folds, and tertiary structures of proteins. In this study, we propose an accurate predictor, BiBagPSS, for predicting protein secondary structure information based on integrating Bidirectional Long Short-Term Memory (BiLSTM) neural network, fully connection (FC) neural network, and the strategy of bootstrap aggregating (Bagging). In BiBagPSS, three different feature views, i.e., position-specific scoring matrix (PSSM), hidden Markov model profile (HMM), and predicted solvent accessibility probability matrix (PSAPM), are first employed to extract different protein-level features. Secondly, the above three features are combined and fed into a stacked neural network composed of the units of BiLSTM and FC. Thirdly, the predicted secondary structure probability matrix (PSSPM) generated by trained model is then added to the input features for re-training the model. In order to fully dig out available information from the training data set, we employ the strategy of bootstrap aggregating to train multiple stacked neural network models. Finally, according to the voting results of the above models, the secondary structure state of each protein residue could be determined. Experimental results show that BiBagPSS achieves Q3 scores of 82.39 and 77.30, Q8 scores of 69.95 and 65.61 on TEST524 and CASP14set data sets, respectively, which are higher than or comparable to most of the state-of-the-art predictors. Detailed data analyses show that the major advantage of BiBagPSS lies in the utilization of the PSSPM that helps extract more discriminative information compared with the previously used machine learning algorithms. Meanwhile, the Bagging strategy improves the ability of BiBagPSS to mine available information.
What problem does this paper attempt to address?