Selecting Appropriate Clustering Methods for Materials Science Applications of Machine Learning

Amanda J. Parker,Amanda S. Barnard
DOI: https://doi.org/10.1002/adts.201900145
2019-10-09
Advanced Theory and Simulations
Abstract:Based on a general definition of a cluster and the quality of a clustering result, a new method for evaluating existing clustering algorithms, or undertaking clustering, capable of predicting the number and type of clusters and outliers present in a data set, regardless of the complexity of the distribution of points, is presented. This algorithm, referred to as iterative label spreading, can recognize the characteristics expected of a successful clustering result before any clustering algorithm is applied, providing a type of hyper‐parameter optimization for clustering. The efficacy of the algorithm, and the assessment of clustering result, are both confirmed using large benchmark two dimensional synthetic data sets, and small multidimensional data describing a set of silver nanoparticles. It is shown that the method is ideal for studying noisy data with high dimensionality and high variance, typical of data captured in materials and nanoscience. A new clustering method is developed that is ideally suited to small data sets with high dimensionality, as commonly found in materials informatics. The method, iterative label spreading outperforms popular methods such as k‐Means, Ward agglomerative clustering, and density‐based spatial clustering of applications with noise, and is used to identify clusters in a diverse set of 425 silver nanoparticles.
multidisciplinary sciences
What problem does this paper attempt to address?