Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

Eunji Ko,Seul Lee,Minseon Kim,Dongki Kim

2024-04-29

Abstract:The goal of protein representation learning is to extract knowledge from protein databases that can be applied to various protein-related downstream tasks. Although protein sequence, structure, and function are the three key modalities for a comprehensive understanding of proteins, existing methods for protein representation learning have utilized only one or two of these modalities due to the difficulty of capturing the asymmetric interrelationships between them. To account for this asymmetry, we introduce our novel asymmetric multi-modal masked autoencoder (AMMA). AMMA adopts (1) a unified multi-modal encoder to integrate all three modalities into a unified representation space and (2) asymmetric decoders to ensure that sequence latent features reflect structural and functional information. The experiments demonstrate that the proposed AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships, which in turn makes it effective for various downstream protein-related tasks.

Biomolecules,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the core challenge in multi - modal protein representation learning, that is, how to effectively integrate the information of the three key modalities of protein sequence, structure and function. Although all three are crucial for a comprehensive understanding of proteins, existing methods can usually only utilize one or two of these modalities and it is difficult to capture the asymmetric relationships among them. Therefore, this paper proposes a new method - Asymmetric Multi - Modal Masked Auto - Encoder (AMMA), aiming to capture the asymmetric relationships among protein modalities through a unified multi - modal encoder and an asymmetric decoder, thereby generating high - quality, comprehensive multi - modal protein representations. Specifically, AMMA solves the problem in the following ways: 1. **Unified multi - modal encoder**: Integrate the sequence, structure and function information of proteins and map this information into a unified representation space. 2. **Asymmetric decoder**: Ensure that the structural and functional information are reflected from the sequence latent features. In this way, AMMA can more accurately capture and represent the multi - modal characteristics of proteins and thus perform well in various downstream tasks. The experimental results show that AMMA outperforms the existing state - of - the - art methods in multiple tasks, especially in making good use of unpaired data, demonstrating its great potential in protein - related research.

Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

Learning Complete Protein Representation by Deep Coupling of Sequence and Structure

Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

Multimodal pretraining for unsupervised protein representation learning

Learning protein sequence embeddings using information from structure

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

A Survey on Protein Representation Learning: Retrospect and Prospect

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

ProteinMAE: Masked Autoencoder for Protein Surface Self-supervised Learning

Learning the Language of Protein Structure

Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

Contrastive Representation Learning for 3D Protein Structures

Clustering for Protein Representation Learning

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Unified rational protein engineering with sequence-only deep representation learning

Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting

Retrieved Sequence Augmentation for Protein Representation Learning

Unified rational protein engineering with sequence-based deep representation learning

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Protein Representation Learning by Geometric Structure Pretraining