Sequence vs. structure: delving deep into data-driven protein function prediction
Xiaochen Tian,Ziyin Wang,Kevin K. Yang,Jin Su,Hanwen Du,Qiuguo Zheng,Guibing Guo,Min Yang,Fei Yang,Fajie Yuan
DOI: https://doi.org/10.1101/2023.04.02.534383
2023-01-01
Abstract:Predicting protein function is a longstanding challenge that has significant scientific implications. The success of amino acid sequence-based learning methods depends on the relationship between sequence, structure, and function. However, recent advances in AlphaFold have led to highly accurate protein structure data becoming more readily available, prompting a fundamental question: given sufficient experimental and predicted structures, should we use structure-based learning methods instead of sequence-based learning methods for predicting protein function, given the intuition that a protein’s structure has a closer relationship to its function than its amino acid sequence? To answer this question, we explore several key factors that affect function prediction accuracy. Firstly, we learn protein representations using state-of-the-art graph neural networks (GNNs) and compare graph construction(GC) methods at the residue and atomic levels. Secondly, we investigate whether protein structures generated by AlphaFold are as effective as experimental structures for function prediction when protein graphs are used as input. Finally, we compare the accuracy of sequence-only, structure-only, and sequence-structure fusion-based learning methods for predicting protein function. Additionally, we make several observations, provide useful tips, and share code and datasets to encourage further research and enhance reproducibility.
### Competing Interest Statement
The authors have declared no competing interest.