SIMpat: A Synthetic Benchmark for Similarity Metrics on Patient Representations

Jean-Virgile Voegeli,Mina Bjelogrlic,Christophe Gaudet-Blavignac,Richard Dubos,Myriam Zimmermann,Adel Bensahla Talet,Yuanyuan Zheng,Julien Ehrsam,Christian Lovis
DOI: https://doi.org/10.3233/SHTI240739
2024-08-22
Abstract:Similarity and clustering tasks based on data extracted from electronic health records on the patient level suffer from the curse of dimensionality and the lack of inter-patient data comparability. Indeed, for many health institutions, there are many more variables, and ways of expressing those variables to represent patients than patients sharing the same set of data. To lower redundancy and increase interoperability one strategy is to map data to semantic-driven representations through medical knowledge graphs such as SNOMED-CT. However, patient similarity metrics based on this knowledge-graph information lack quantitative evaluation and comparisons with pure data-driven methods. The reasons are twofold, firstly, it is hard to conceptually assess and formalize a gold-standard similarity between patients resulting in poor inter-annotator agreement in qualitative evaluations. Secondly, the community has been lacking a clear benchmark to compare existing metrics developed by scientific communities coming from various fields such as ontology, data science, and medical informatics. This study proposes to leverage the known challenges of evaluating patient similarities by proposing SIMpat, a synthetic benchmark to quantitatively evaluate available metrics, based on controlled cohorts, which could later be used to assess their sensibility regarding aspects such as the sparsity of variables or specificities of patient disease patterns.
What problem does this paper attempt to address?