Abstract:Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given molecule's energetically accessible conformational ensemble. As simulations on the 10-100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by "sieved clustering" were evaluated. Overall, it was found that there is no one perfect "one size fits all" algorithm for clustering MD trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

A Novel Kernel Possibitistic Fuzzy C-Means Clustering Algorithm For Large Scale Data Sets

A Fast Algorithm for Density-Based Clustering in Large Database

Efficient clustering of large molecular libraries

Clustering a database of optically absorbing organic molecules via a hierarchical fingerprint scheme that categorizes similar functional molecular fragments

Fast conformational clustering of extensive molecular dynamics simulation data

Data Clustering Algorithm for DNA Microarray Based on Graph Theory

Multiphase and Multimode Monitoring of Batch Processes Based on Density Peak Clustering and Just-in-time Learning

Efficient Clustering with Limited Distance Information

Clustering Molecules at a Large Scale: Integrating Spectral Geometry with Deep Learning

Efficient clustering in data mining applications based on harmony search and k-medoids

Fast Clustering Using Adaptive Density Peak Detection

A Fast Parallel Clustering Algorithm For Molecular Simulation Trajectories

A Comparative Study of Clustering Methods for Molecular Data

A Novel Method for Clustering Cellular Data to Improve Classification

A robust unsupervised clustering approach for high-dimensional biological imaging data reveals shared drug-induced morphological signatures

Parallel Massive Clustering of Discrete Distributions

On the Best Way to Cluster NCI-60 Molecules

Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms

Clustering Large Datasets by Merging K-Means Solutions

Density Peaks Clustering Based on Jaccard Similarity and Label Propagation