Simulation-derived best practices for clustering clinical data
Caitlin E Coombes,Xin Liu,Zachary B Abrams,Kevin R Coombes,Guy Brock,Caitlin E. Coombes,Zachary B. Abrams,Kevin R. Coombes
DOI: https://doi.org/10.1016/j.jbi.2021.103788
IF: 8
2021-06-01
Journal of Biomedical Informatics
Abstract:<h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Introduction</h3><p>Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Methods</h3><p>We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), <em>k-</em>medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW).</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Results</h3><p>HC outperformed <em>k-</em>medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to the Hamming distance, a real data application of DAISY with HC uncovered superior, separable clusters.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Discussion</h3><p>Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.</p>
medical informatics,computer science, interdisciplinary applications