AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Lingyan Zheng,Shuiyang Shi,Mingkun Lu,Pan Fang,Ziqi Pan,Hongning Zhang,Zhimeng Zhou,Hanyu Zhang,Minjie Mou,Shijie Huang,Lin Tao,Weiqi Xia,Honglin Li,Zhenyu Zeng,Shun Zhang,Yuzong Chen,Zhaorong Li,Feng Zhu
DOI: https://doi.org/10.1186/s13059-024-03166-1
IF: 17.906
2024-02-03
Genome Biology
Abstract:Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272
genetics & heredity,biotechnology & applied microbiology
What problem does this paper attempt to address?
This paper attempts to solve the "long - tail problem" in protein function annotation. Specifically, existing protein function annotation methods perform poorly when dealing with functional families containing a small number of annotated proteins, resulting in a significant decline in the annotation performance of these "tail - label" families. The paper proposes a new strategy - AnnoPRO, which aims to improve the annotation performance of "tail - label" families without sacrificing the annotation effect of "head - label" families through multi - scale protein representation, dual - path encoding and LSTM - based decoding. ### Main problems: 1. **Long - tail problem**: In the Gene Ontology (GO) database, many functional families contain a very small number of annotated proteins, forming a "long - tail distribution". This data imbalance leads to poor performance of existing methods in annotating "tail - label" families. 2. **Limitations of existing methods**: Traditional sequence homology (SH) methods and machine learning (ML) methods have limitations when dealing with "tail - label" families, especially in cases of low homology, where the annotation accuracy drops significantly. ### Solutions: 1. **Multi - scale protein representation**: By converting protein sequences into feature similarity images (ProMAP) and protein similarity vectors (ProSIM), the intrinsic associations between protein features are captured and global correlations are considered. 2. **Dual - path encoding**: A pre - trained seven - channel convolutional neural network (7C - CNN) and a five - layer fully - connected deep neural network (5FC - DNN) are used for dual - path encoding to improve the robustness and generalization ability of the model. 3. **LSTM - based decoding**: A long short - term memory (LSTM) recurrent neural network is used for multi - label annotation to improve the annotation performance of "tail - label" families. ### Experimental results: - **Overall performance**: AnnoPRO outperforms eight existing popular methods on multiple benchmark datasets, especially showing a significant improvement in the annotation performance of "tail - label" families. - **Hierarchical performance comparison**: At the "head - label" levels (LEVEL 2 and LEVEL 3), AnnoPRO performs comparably to existing methods; at the "tail - label" levels (LEVEL 4 to LEVEL 10), AnnoPRO's performance is significantly better than other methods. - **Cross - species performance**: AnnoPRO performs excellently in protein annotation of different species, especially in the "different species" group (DiffSP), where its performance is significantly better than DeepGOPlus and PFmulDL. ### Conclusion: AnnoPRO effectively solves the "long - tail problem" in protein function annotation through an innovative multi - scale protein representation and dual - path encoding framework, improves the annotation performance of "tail - label" families, and maintains high accuracy for "head - label" families. This method is expected to become an important tool for solving the long - existing "long - tail problem" in the field of protein function annotation.