Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein Secondary Structure Prediction

Akosua Busia,Navdeep Jaitly
DOI: https://doi.org/10.48550/arXiv.1702.03865
2017-02-14
Abstract:Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we show how to adapt some of these techniques to create a novel chained convolutional architecture with next-step conditioning for improving performance on protein sequence prediction problems. We explore its value by demonstrating its ability to improve performance on eight-class secondary structure prediction. We first establish a state-of-the-art baseline by adapting recent advances in convolutional neural networks which were developed for vision tasks. This model achieves 70.0% per amino acid accuracy on the CB513 benchmark dataset without use of standard performance-boosting techniques such as ensembling or multitask learning. We then improve upon this state-of-the-art result using a novel chained prediction approach which frames the secondary structure prediction as a next-step prediction problem. This sequential model achieves 70.3% Q8 accuracy on CB513 with a single model; an ensemble of these models produces 71.4% Q8 accuracy on the same test set, improving upon the previous overall state of the art for the eight-class secondary structure problem. Our models are implemented using TensorFlow, an open-source machine learning software library available at <a class="link-external link-http" href="http://TensorFlow.org" rel="external noopener nofollow">this http URL</a>; we aim to release the code for these experiments as part of the TensorFlow repository.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of protein secondary structure prediction. Specifically, by introducing the latest techniques in deep learning, the authors developed a novel chained convolutional neural network architecture and combined it with the next - step conditional prediction method to improve the performance of the protein sequence prediction problem. ### Problem Background The structure of a protein is crucial to its function. The secondary structure of a protein refers to the spatial arrangement of local amino acid residues in the protein, such as α - helices and β - sheets, etc. Accurately predicting the secondary structure of a protein can help scientists better understand the function of the protein, thereby accelerating drug development and other biomedical research. With the growth in the number of known protein sequences, the experimentally determined secondary structure data are far from keeping up. Therefore, computational methods are becoming increasingly important in protein structure prediction. Traditional machine - learning methods have made some progress in this field, but there is still much room for improvement. ### Core Problems of the Paper 1. **Improving the Prediction Accuracy of a Single Model**: First, by introducing techniques such as Batch Normalization, Dropout, weight - norm constraint, residual connections, and multi - scale convolutional filters, the authors constructed a new convolutional neural network architecture, which significantly increased the Q8 accuracy of a single model on the CB513 benchmark dataset to 70.0%. 2. **Introducing Next - Step Conditional Prediction**: To further improve the prediction performance, the authors redefined the protein secondary structure prediction problem as a chained prediction task, that is, predicting the current label conditionally depending on the previous labels. This method draws on the idea of language models in natural language processing. By introducing the true or predicted labels of the previous step as context during the training process, the model can better capture the dependencies in the sequence. Eventually, this method increased the Q8 accuracy on the CB513 dataset to 70.3%. 3. **Ensemble Model**: To verify the generalization ability of the model, the authors also trained and evaluated an ensemble of multiple independent models. The results show that the performance of the ensemble model is better than that of a single model, achieving a Q8 accuracy of 71.4%, exceeding the best previously reported results. ### Summary The main contributions of this paper are as follows: - Proposing a new convolutional neural network architecture, combined with multiple deep - learning techniques, which significantly improves the accuracy of protein secondary structure prediction. - Introducing the next - step conditional prediction method, which further enhances the prediction performance by modeling the dependencies between labels. - Proving that it is possible to surpass existing methods even without using enhancement techniques such as multi - task learning, demonstrating the powerful performance of the model. These improvements not only help to improve the accuracy of protein secondary structure prediction but also provide new ideas and technical references for future research.