Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

Eunji Ko,Seul Lee,Minseon Kim,Dongki Kim
2024-04-29
Abstract:The goal of protein representation learning is to extract knowledge from protein databases that can be applied to various protein-related downstream tasks. Although protein sequence, structure, and function are the three key modalities for a comprehensive understanding of proteins, existing methods for protein representation learning have utilized only one or two of these modalities due to the difficulty of capturing the asymmetric interrelationships between them. To account for this asymmetry, we introduce our novel asymmetric multi-modal masked autoencoder (AMMA). AMMA adopts (1) a unified multi-modal encoder to integrate all three modalities into a unified representation space and (2) asymmetric decoders to ensure that sequence latent features reflect structural and functional information. The experiments demonstrate that the proposed AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships, which in turn makes it effective for various downstream protein-related tasks.
Biomolecules,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the core challenge in multi - modal protein representation learning, that is, how to effectively integrate the information of the three key modalities of protein sequence, structure and function. Although all three are crucial for a comprehensive understanding of proteins, existing methods can usually only utilize one or two of these modalities and it is difficult to capture the asymmetric relationships among them. Therefore, this paper proposes a new method - Asymmetric Multi - Modal Masked Auto - Encoder (AMMA), aiming to capture the asymmetric relationships among protein modalities through a unified multi - modal encoder and an asymmetric decoder, thereby generating high - quality, comprehensive multi - modal protein representations. Specifically, AMMA solves the problem in the following ways: 1. **Unified multi - modal encoder**: Integrate the sequence, structure and function information of proteins and map this information into a unified representation space. 2. **Asymmetric decoder**: Ensure that the structural and functional information are reflected from the sequence latent features. In this way, AMMA can more accurately capture and represent the multi - modal characteristics of proteins and thus perform well in various downstream tasks. The experimental results show that AMMA outperforms the existing state - of - the - art methods in multiple tasks, especially in making good use of unpaired data, demonstrating its great potential in protein - related research.