Abstract:The advent of affordable high-throughput genome sequencing has drastically expanded protein sequence databases, necessitating the development of computational tools to predict protein function from sequence data. Current methods, such as BLASTp and profile HMMs, while effective, are limited by difficulties in detecting remote homologs and uncertainties in multiple sequence alignments. To address this, we explore the use of clustering algorithms for unsupervised protein function annotation, using pseudo-amino acid composition (PAAC) as features. In this study, we evaluated nine clustering algorithms for their ability to segregate protein sequences based on functional differences using the PAAC feature. Using intrinsic metrics, particularly the silhouette coefficient (SC), we determined the optimal number of clusters ( ) for each algorithm. We observed that agglomerative clustering produced results resembling phylogenetic relationships; even k-means clustering, Gaussian mixture model(GMM), and spectral clustering do so but occasionally merge datapoints from distinct original clusters at higher values. Our findings reveal that k-means clustering, GMM, and agglomerative clustering effectively segregate distinct protein functional families, but effectiveness decreases when distinguishing fine-grained functional differences. Notably, spectral clustering underperformed relative to other methods. Affinity propagation clustering, while effective in some cases, generated more clusters than expected and is prone to false positives. Overall, we find that some of the clustering algorithms are suitable for functional annotation of protein sequences using PAAC as a feature set, even when the number of ground-truth sequences is limited. The implementation of the clustering method for protein sequences is available in the GitHub repository linked below. It provides comprehensive steps for preprocessing, feature extraction, clustering, and evaluation. All results are presented in a Jupyter Notebook.

Clustering for Protein Representation Learning

How suitable are clustering methods for functional annotation of proteins?

Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

Towards Automatic Clustering of Protein Sequences

A self-learning graph clustering approach for protein complexes detection

Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

Deep Multi-attribute Graph Representation Learning on Protein Structures

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

Contrastive Self-Supervised Representation Learning for Protein Complexes Identification.

A Novel Alignment-Free Vector Method to Cluster Protein Sequences

T-distributed Stochastic Neighbor Embedding for Co-Representation Learning

Contrastive Representation Learning for 3D Protein Structures

Multi-Scale Representation Learning on Proteins

Ensemble Clustering via Learning Representations from Auto-Encoder

Protein Complex Detection Via Weighted Ensemble Clustering Based on Bayesian Nonnegative Matrix Factorization

Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

Protein2Vec: Aligning Multiple PPI Networks with Representation Learning

Directed Weight Neural Networks for Protein Structure Representation Learning

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Intra-Inter Graph Representation Learning for Protein-Protein Binding Sites Prediction

Protein Representation Learning by Geometric Structure Pretraining