Leveraging Sequence Embedding and Convolutional Neural Network for Protein Function Prediction

Wei-Cheng Tseng,Po-Han Chi,Jia-Hua Wu,Min Sun
DOI: https://doi.org/10.48550/arXiv.2112.00344
2021-12-01
Abstract:The capability of accurate prediction of protein functions and properties is essential in the biotechnology industry, e.g. drug development and artificial protein synthesis, etc. The main challenges of protein function prediction are the large label space and the lack of labeled training data. Our method leverages unsupervised sequence embedding and the success of deep convolutional neural network to overcome these challenges. In contrast, most of the existing methods delete the rare protein functions to reduce the label space. Furthermore, some existing methods require additional bio-information (e.g., the 3-dimensional structure of the proteins) which is difficult to be determined in biochemical experiments. Our proposed method significantly outperforms the other methods on the publicly available benchmark using only protein sequences as input. This allows the process of identifying protein functions to be sped up.
Quantitative Methods,Artificial Intelligence,Machine Learning,Biomolecules
What problem does this paper attempt to address?
This paper attempts to solve several key challenges in protein function prediction. Specifically, it aims to overcome the following problems: 1. **Large label space**: The types of protein functions are very diverse, which leads to a large label space. Many existing methods reduce the label space by deleting rare protein functions, but this may result in information loss. 2. **Lack of labeled data**: The labeled data of protein functions is relatively scarce, which limits the application of supervised learning methods. The method proposed in the paper utilizes unsupervised sequence embedding techniques and can effectively use unlabeled data. 3. **Dependence on additional biological information**: Some existing methods require additional biological information, such as the three - dimensional structure of proteins, which is difficult to determine in biochemical experiments. The method proposed in the paper only uses protein sequences as input and avoids dependence on additional information. To address these challenges, the paper proposes a method that combines sequence embedding techniques and deep convolutional neural networks. The main contributions of this method include: - **Proposing a protein function prediction model that combines sequence embedding and deep convolutional neural networks**, achieving state - of - the - art performance on publicly available datasets. - **The inference time of this method is shorter than that of existing models**, which can accelerate the process of protein function verification and improve the efficiency of related applications. Through these improvements, the paper provides an efficient and accurate method for protein function prediction, which is helpful for accelerating applications such as drug development and artificial protein synthesis in the biotechnology field.