Prediction of E.coli Promoters Based on CNN
Peng Bao-Cheng,Zhang Xiao-Wei,Liu Yang,Fan Guo-Liang
DOI: https://doi.org/10.16476/j.pibb.2021.0139
2022-01-01
PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS
Abstract:Objective The prediction model based on PSSM(position-specific scoring matrix)has achievedgood results, and various optimization methods based on PSSM are also being continuously developed. However, the accuracy rate is relatively lower.In order to further improve the prediction accuracy rate,this paper does further research based on the CNN algorithm. Methods In this paper, PSSM is used to process the letter sequence into a numeric matrix, and through a convolutional neural network(CNN)algorithm for classification.The3promoter sequences of Sigma38,Sigma54and Sigma70ofE.coliK-12(Escherichia coliK-12,hereinafterreferred to as Escherichia coli)are used as the positive sets,and the sequences of the Coding and Non-codingregions of Esche richia coliare the negative set. Results In the prediction of Escherichia coli for the two-classification for promoters, the accuracy rate reaches99%,and the success rate of promoter prediction is close to100%;in the three-classification for Sigma38,Sigma54and Sigma70promoters,the prediction accuracy rate is98%,and for each the prediction accuracy of these sequences can reach0.98or more.Finally,we tried4classifications of3promoters of Sigma38,Sigma54and Sigma70with Coding area or Non-coding area sequencesrespectively, the accuracy of prediction was0.98.The prediction accuracy of the ten-fold cross-validation of the balanced samples of the Sigma promoters can reach more than0.95,the Hamming distance is0.016,and the Kappa coefficient is0.97. Conclusion Compared with other classification algorithms such as SVM(supportvector machine),the CNN classification algorithm has more advantages, and based on the classification advantages of CNN, the coding method can also be simplified