Abstract:Abstract Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer’s representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue–residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.

FoldToken2: Learning compact, invariant and generative protein structure language

FoldToken4: Consistent & Hierarchical Fold Language

FoldToken3: Fold Structures Worth 256 Words or Less

FoldToken: Learning Protein Language via Vector Quantization and Beyond

FoldToken4: Consistent & Hierarchical Fold Language

Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary

PiFold: Toward effective and efficient protein inverse folding

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

OPUS-Fold3: a Gradient-Based Protein All-Atom Folding and Docking Framework on TensorFlow

Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

EigenFold: Generative Protein Structure Prediction with Diffusion Models

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Improving protein fold recognition using triplet network and ensemble deep learning

Balancing Locality and Reconstruction in Protein Structure Tokenizer

Learning inverse folding from millions of predicted structures

Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

FoldExplorer: Fast and Accurate Protein Structure Search with Sequence-Enhanced Graph Embedding

Easy and accurate protein structure prediction using ColabFold

Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold

Learning the Language of Protein Structure

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

FoldToken2: Learning compact, invariant and generative protein structure language

FoldToken4: Consistent &amp; Hierarchical Fold Language

FoldToken3: Fold Structures Worth 256 Words or Less

FoldToken: Learning Protein Language via Vector Quantization and Beyond

FoldToken4: Consistent & Hierarchical Fold Language

Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary

PiFold: Toward effective and efficient protein inverse folding

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

OPUS-Fold3: a Gradient-Based Protein All-Atom Folding and Docking Framework on TensorFlow

Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

EigenFold: Generative Protein Structure Prediction with Diffusion Models

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Improving protein fold recognition using triplet network and ensemble deep learning

Balancing Locality and Reconstruction in Protein Structure Tokenizer

Learning inverse folding from millions of predicted structures

Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

FoldExplorer: Fast and Accurate Protein Structure Search with Sequence-Enhanced Graph Embedding

Easy and accurate protein structure prediction using ColabFold

Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold

Learning the Language of Protein Structure

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

FoldToken4: Consistent & Hierarchical Fold Language