How suitable are clustering methods for functional annotation of proteins?

Rakesh Busi,Pranav Machingal,Nandyala Hemachandra,Petety V. Balaji
DOI: https://doi.org/10.1101/2024.12.26.630370
2024-12-26
Abstract:The advent of affordable high-throughput genome sequencing has drastically expanded protein sequence databases, necessitating the development of computational tools to predict protein function from sequence data. Current methods, such as BLASTp and profile HMMs, while effective, are limited by difficulties in detecting remote homologs and uncertainties in multiple sequence alignments. To address this, we explore the use of clustering algorithms for unsupervised protein function annotation, using pseudo-amino acid composition (PAAC) as features. In this study, we evaluated nine clustering algorithms for their ability to segregate protein sequences based on functional differences using the PAAC feature. Using intrinsic metrics, particularly the silhouette coefficient (SC), we determined the optimal number of clusters ( ) for each algorithm. We observed that agglomerative clustering produced results resembling phylogenetic relationships; even k-means clustering, Gaussian mixture model(GMM), and spectral clustering do so but occasionally merge datapoints from distinct original clusters at higher values. Our findings reveal that k-means clustering, GMM, and agglomerative clustering effectively segregate distinct protein functional families, but effectiveness decreases when distinguishing fine-grained functional differences. Notably, spectral clustering underperformed relative to other methods. Affinity propagation clustering, while effective in some cases, generated more clusters than expected and is prone to false positives. Overall, we find that some of the clustering algorithms are suitable for functional annotation of protein sequences using PAAC as a feature set, even when the number of ground-truth sequences is limited. The implementation of the clustering method for protein sequences is available in the GitHub repository linked below. It provides comprehensive steps for preprocessing, feature extraction, clustering, and evaluation. All results are presented in a Jupyter Notebook.
Bioinformatics
What problem does this paper attempt to address?