Abstract:BACKGROUND:Many clinical concepts are standardized under a categorical and hierarchical taxonomy such as ICD-10, ATC, etc. These taxonomic clinical concepts provide insight into semantic meaning and similarity among clinical concepts and have been applied to patient similarity measures. However, the effects of diverse set sizes of taxonomic clinical concepts contributing to similarity at the patient level have not been well studied.METHODS:In this paper the most widely used taxonomic clinical concepts system, ICD-10, was studied as a representative taxonomy. The distance between ICD-10-coded diagnosis sets is an integrated estimation of the information content of each concept, the similarity between each pairwise concepts and the similarity between the sets of concepts. We proposed a novel method at the set-level similarity to calculate the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. A real-world clinical dataset with ICD-10 coded diagnoses and hospital length of stay (HLOS) information was used to evaluate the performance of various algorithms and their combinations in predicting whether a patient need long-term hospitalization or not. Four subpopulation prototypes that were defined based on age and HLOS with different diagnoses set sizes were used as the target for similarity analysis. The F-score was used to evaluate the performance of different algorithms by controlling other factors. We also evaluated the effect of prototype set size on prediction precision.RESULTS:The results identified the strengths and weaknesses of different algorithms to compute information content, code-level similarity and set-level similarity under different contexts, such as set size and concept set background. The minimum weighted bipartite matching approach, which has not been fully recognized previously showed unique advantages in measuring the concepts-based patient similarity.CONCLUSIONS:This study provides a systematic benchmark evaluation of previous algorithms and novel algorithms used in taxonomic concepts-based patient similarity, and it provides the basis for selecting appropriate methods under different clinical scenarios.

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Comparison of internal evaluation criteria in hierarchical clustering of categorical data

Using the Distance Between Sets of Hierarchical Taxonomic Clinical Concepts to Measure Patient Similarity

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

Exploring Hierarchical Classification Performance for Time Series Data: Dissimilarity Measures and Classifier Comparisons

A heuristic hierarchical clustering based on multiple similarity measurements

A New Metrics For Hierarchical Clustering

A Revenue Function for Comparison-Based Hierarchical Clustering

A method for k-means-like clustering of categorical data

Measuring Similarity Of Chinese Web Databases Based On Category Hierarchy

Hierarchical Clustering: Objective Functions and Algorithms

Comparison of Distance metrics for hierarchical data in medical databases

Effective hierarchical clustering based on structural similarities in nearest neighbor graphs

Agglomerative Clustering in Uniform and Proportional Feature Spaces

A Kind of Similarity Degree for Non-Precise Data with Application to Clustering Analysis

Measuring Similarity Based on Link Information: A Comparative Study

A principled methodology for comparing relatedness measures for clustering publications

Cluster Merging and Splitting in Hierarchical Clustering Algorithms

A Comparative Study on Two Large-Scale Hierarchical Text Classification Tasks' Solutions

EDMD: An Entropy based Dissimilarity measure to cluster Mixed-categorical Data