SCOP: A Sequence-Structure Contrast-Aware Framework for Protein Function Prediction

Runze Ma,Chengxin He,Huiru Zheng,Xinye Wang,Haiying Wang,Yidan Zhang,Lei Duan
DOI: https://doi.org/10.48550/arXiv.2411.11366
2024-11-18
Abstract:Improving the ability to predict protein function can potentially facilitate research in the fields of drug discovery and precision medicine. Technically, the properties of proteins are directly or indirectly reflected in their sequence and structure information, especially as the protein function is largely determined by its spatial properties. Existing approaches mostly focus on protein sequences or topological structures, while rarely exploiting the spatial properties and ignoring the relevance between sequence and structure information. Moreover, obtaining annotated data to improve protein function prediction is often time-consuming and costly. To this end, this work proposes a novel contrast-aware pre-training framework, called SCOP, for protein function prediction. We first design a simple yet effective encoder to integrate the protein topological and spatial features under the structure view. Then a convolutional neural network is utilized to learn the protein features under the sequence view. Finally, we pretrain SCOP by leveraging two types of auxiliary supervision to explore the relevance between these two views and thus extract informative representations to better predict protein function. Experimental results on four benchmark datasets and one self-built dataset demonstrate that SCOP provides more specific results, while using less pre-training data.
Biomolecules
What problem does this paper attempt to address?
This paper aims to solve several key problems in protein function prediction: 1. **Scarcity of protein labels**: One of the main challenges faced by existing protein function prediction methods is the lack of labeled data. Data on the physicochemical properties and biological functions of proteins are usually obtained through time - consuming and costly wet - laboratory experiments, so such data are very scarce. 2. **Insufficient learning of structural features**: The function of a protein is largely determined by its spatial structure. However, existing sequence - based methods often ignore the spatial structure information of proteins, and most structure - based methods only consider the two - dimensional topological structure of proteins and ignore the spatial features of specific conformations in three - dimensional space, resulting in incomplete learned representations. 3. **Under - utilization of the correlation between sequence and structure**: Protein sequence descriptors and structure descriptors describe proteins at different levels respectively. However, existing methods either learn protein representations from only one perspective or simply perform feature extraction on sequences and structures, failing to fully utilize the correlation and association between sequences and structures, making the learned representations may not be comprehensive enough. To solve the above problems, the paper proposes a new contrast - aware pre - training framework named SCOP (Sequence - Structure Contrast - Aware Pre - training) for protein function prediction. The main features of SCOP include: - **Introducing a protein structure encoder** to integrate the topological and spatial features of proteins. - **Fully utilizing the supervision information in protein sequence - structure pairings** to explore the correlation between these two views. - **Proposing a contrast - aware pre - training framework** that can learn protein representations without label information. Experimental results on four benchmark datasets and one self - built dataset show that SCOP can provide more specific results and use less pre - training data.