Abstract:Clustering is the task of organizing a set of objects into meaningful groups. These groups can be disjoint, overlapping, or organized in some hierarchical fashion. The key element of clustering is the notion that the discovered groups are meaningful. This definition is intentionally vague, as what constitutes meaningful is to a large extent, application dependent. In some applications this may translate to groups in which the pairwise similarity between their objects is maximized, and the pairwise similarity between objects of different groups is minimized. In some other applications this may translate to groups that contain objects that share some key characteristics, even though their overall similarity is not the highest. Clustering is an exploratory tool for analyzing large datasets, and has been used extensively in numerous application areas. Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. For example, clustering algorithms applied to gene expression data can be used to identify co-regulated genes and provide a genetic fingerprint for various diseases. Clustering algorithms applied on the entire database of known proteins can be used to automatically organize the different proteins into close- and distant-related families, and identify subsequences that are mostly preserved across proteins (52, 22, 55, 68, 49). Similarly, clustering algorithms applied to the tertiary structural datasets can be used to perform a similar organization and provide insights in the rate of change between sequence and structure (20, 65). The primary goal of this chapter is to provide an overview of the various issues involved in clustering large datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life-sciences. Toward this end, the chapter is organized in broadly three parts. The first part (Sections 2- 4) describes the various types of clustering algorithms developed over the years, the various methods for computing the similarity between objects arising in life sciences, and methods for assessing the quality of the clusters. The second part (Section 5) focuses on the problem of clustering data arising from microarray experiments and describes some of the commonly used approaches. Finally, the third part (Section 6) provides a brief introduction to CLUTO, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.

Sequence Clustering in Bioinformatics: an Empirical Study.

Comparison of Methods for Biological Sequence Clustering.

Two-Stage Clustering (Tsc): A Pipeline For Selecting Operational Taxonomic Units For The High-Throughput Sequencing Of Pcr Amplicons

Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences

Gclust:A Parallel Clustering Tool for Microbial Genomic Data

A Comparison of Methods for Clustering 16s Rrna Sequences into Otus

Clustering in life sciences.

An Efficient Greedy Incremental Sequence Clustering Algorithm

Heuristic Clustering Method Based on Neighbor-Seeds for 454 Sequencing Data

Biootu: an Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s Rrna Gene Sequences.

The Cluster Methods for Analysis the Sequences

Single-cell RNA-seq Data Clustering: A Survey with Performance Comparison Study

Clustering 16S Rrna for OTU Prediction: a Method of Unsupervised Bayesian Clustering.

An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies

Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges

Cluseq: Efficient And Effective Sequence Clustering

A comparative study of clustering algorithms for protein sequences

MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence

A clustering method for next-generation sequences of bacterial genomes through multiomics data mapping

Single-cell RNA-seq clustering: datasets, models, and algorithms

Analysis of Single-Cell RNA-seq Data by Clustering Approaches