Contrastive Representation Learning for 3D Protein Structures

Pedro Hermosilla,Timo Ropinski

2022-05-31

Abstract:Learning from 3D protein structures has gained wide interest in protein modeling and structural bioinformatics. Unfortunately, the number of available structures is orders of magnitude lower than the training data sizes commonly used in computer vision and machine learning. Moreover, this number is reduced even further, when only annotated protein structures can be considered, making the training of existing models difficult and prone to over-fitting. To address this challenge, we introduce a new representation learning framework for 3D protein structures. Our framework uses unsupervised contrastive learning to learn meaningful representations of protein structures, making use of proteins from the Protein Data Bank. We show, how these representations can be used to solve a large variety of tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction. Moreover, we show how fine-tuned networks, pre-trained with our algorithm, lead to significantly improved task performance, achieving new state-of-the-art results in many tasks.

Biomolecules,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue in protein structure modeling and structural bioinformatics, where the amount of 3D protein structure data is far less than the dataset sizes commonly used in computer vision and machine learning fields, leading to difficulties in training existing models and a tendency to overfit. Specifically, the number of available 3D protein structures is much smaller than sequence data, and this number further decreases when considering protein structures with specific attribute labels. To tackle this challenge, the authors introduce a new contrastive learning-based representation learning framework for 3D protein structures. This framework leverages unsupervised contrastive learning to learn meaningful representations from proteins in the Protein Data Bank (PDB) and demonstrates how these representations can be used to address various tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction. By pre-training the model and fine-tuning it on specific tasks, the authors show that their method achieves new state-of-the-art levels on multiple tasks.

Contrastive Representation Learning for 3D Protein Structures

Learning Complete Protein Representation by Deep Coupling of Sequence and Structure

CCPL: Cross-modal Contrastive Protein Learning

Protein Representation Learning by Geometric Structure Pretraining

Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

Learning the Language of Protein Structure

Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy

Learning protein sequence embeddings using information from structure

Multimodal Protein-Ligand Contrastive Pretraining for Effective and Efficient Drug Discovery

Evaluating representation learning on the protein structure universe

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Multimodal pretraining for unsupervised protein representation learning

Directed Weight Neural Networks for Protein Structure Representation Learning

Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation