PEST: A General-Purpose Protein Embedding Model for Homology Search

Yongchang Liu,Peiying Li,Shikui Tu,Lei Xu
DOI: https://doi.org/10.1109/bibm58861.2023.10385721
2023-01-01
Abstract:Finding known homologs of newly predicted proteins is essential for understanding their functions and mechanisms. It is a highly complex task because proteins undergo various changes during evolution. Traditional methods based on sequence or structure alignment either have low accuracy or take a long time. Recent deep learning-based methods primarily focus on structural information, yet they can’t fully exploiting protein information. To solve this problem, in this paper, we propose a novel general-purpose protein embedding model that can be used for homology search. It first employs a protein language pre-trained model to extract protein sequence embeddings, capturing intricate biological patterns. Subsequently, a Transformer integrating protein structural information generates the high-level representations. By combining protein sequence and structural features, the model can effectively exploit the rich contextual and spatial information inherent in proteins. We applied the model to the SCOP dataset for protein superfamily classification, achieving a classification accuracy of 86.97%, outperforming state-of-the-art method by 7.91%. The source code has been published on GitHub (https://github.com/CMACH508/PEST).
What problem does this paper attempt to address?