Abstract:The advent of affordable high-throughput genome sequencing has drastically expanded protein sequence databases, necessitating the development of computational tools to predict protein function from sequence data. Current methods, such as BLASTp and profile HMMs, while effective, are limited by difficulties in detecting remote homologs and uncertainties in multiple sequence alignments. To address this, we explore the use of clustering algorithms for unsupervised protein function annotation, using pseudo-amino acid composition (PAAC) as features. In this study, we evaluated nine clustering algorithms for their ability to segregate protein sequences based on functional differences using the PAAC feature. Using intrinsic metrics, particularly the silhouette coefficient (SC), we determined the optimal number of clusters ( ) for each algorithm. We observed that agglomerative clustering produced results resembling phylogenetic relationships; even k-means clustering, Gaussian mixture model(GMM), and spectral clustering do so but occasionally merge datapoints from distinct original clusters at higher values. Our findings reveal that k-means clustering, GMM, and agglomerative clustering effectively segregate distinct protein functional families, but effectiveness decreases when distinguishing fine-grained functional differences. Notably, spectral clustering underperformed relative to other methods. Affinity propagation clustering, while effective in some cases, generated more clusters than expected and is prone to false positives. Overall, we find that some of the clustering algorithms are suitable for functional annotation of protein sequences using PAAC as a feature set, even when the number of ground-truth sequences is limited. The implementation of the clustering method for protein sequences is available in the GitHub repository linked below. It provides comprehensive steps for preprocessing, feature extraction, clustering, and evaluation. All results are presented in a Jupyter Notebook.

Parameterized Algorithms for Clustering PPI Networks

Fast Algorithms for Detecting Overlapping Functional Modules in Protein-Protein Interaction Networks.

A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks

Algorithms Based on Density and Shared Neighbors for Functional Modules Identification in PPI Networks

PFC: An Efficient Soft Graph Clustering Method for PPI Networks Based on Purifying and Filtering the Coupling Matrix

Progress on Graph-Based Clustering Methods for the Analysis of Protein-Protein Interaction Networks

Parameterized Algorithms for Partitioning Graphs into Highly Connected Clusters

Hierarchical Organization of Functional Modules in Weighted Protein Interaction Networks Using Clustering Coefficient

Entropy-Based Graph Clustering of PPI Networks for Predicting Overlapping Functional Modules of Proteins

PPI-GA: A Novel Clustering Algorithm to Identify Protein Complexes within Protein-Protein Interaction Networks Using Genetic Algorithm

Protein Function Prediction by Spectral Clustering of Protein Interaction Network

A Degree-Distribution Based Hierarchical Agglomerative Clustering Algorithm for Protein Complexes Identification

Clustering PPI networks based on improved spectral clustering method

A Fast Iterative-Clique Percolation Method for Identifying Functional Modules in Protein Interaction Networks

An Effective Link-Based Clustering Algorithm for Detecting Overlapping Protein Complexes in Protein-Protein Interaction Networks

Recent Advances in Clustering Methods for Protein Interaction Networks

Identifying Protein Complexes in Protein-Protein Interaction Networks by Using Clique Seeds and Graph Entropy

Spectral Clustering For Detecting Protein Complexes In Ppi Networks

A MapReduce-Based Parallel Clustering Algorithm for Large Protein-Protein Interaction Networks

A Hybrid Clustering Algorithm for Identifying Modules in Protein-Protein Interaction Networks.

How suitable are clustering methods for functional annotation of proteins?