Abstract WP236: Harmonization of Stroke Risk Prediction Variables Using Natural Language Processing

Pratheek Mallya,Juan Zhao,Chuan Hong,Ricardo Henao,Daniel Wojdyla,Tony Schibler,Vihaan Manchanda,Michael Pencina,Jennifer L Hall
DOI: https://doi.org/10.1161/str.55.suppl_1.wp236
IF: 10.17
2024-02-03
Stroke
Abstract:Stroke, Volume 55, Issue Suppl_1, Page AWP236-AWP236, February 1, 2024. Introduction:Early identification of stroke risk can profoundly influence an individual's chance of survival. While machine learning-based stroke risk prediction models perform better than traditional counterparts, they require integrating diverse open data sources. Variable discrepancies - where the same concept is described differently - pose challenges for data integration, requiring time-intensive manual harmonization. To address this, we developed an automated harmonization technique using natural language models.Methods:We utilized data from the Atherosclerosis Risk in Communities (ARIC), Multi-Ethnic Study of Atherosclerosis (MESA), Framingham Offspring Study, and Reasons for Geographic and Racial Differences in Stroke (REGARDS) cohorts. Our approach involved training a two-layer fully connected neural network (FCN) to predict a harmonized 'concept variable' based on variable descriptions. These predictions employed embeddings from Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT). Further refining was achieved through paired sentence classification tasks to determine if description pairs shared identical concepts. We experimented with both separate cohort training and combined cohort training to evaluate context dependency. We used cosine similarity between the embeddings of the pre-trained BioBERT encoder as a prediction score for our baseline method.Results:The harmonization methods using natural language models outperformed the baseline method. The FCN model improved the area under the receiver operating characteristic from 0.787 (baseline) to 0.985 for the paired classification task. For harmonization concept prediction, the top-2 accuracy - correct prediction in the top two answers - improved from 11.9% (baseline) to 54.6%, while the top-5 accuracy improved from 16.1% (baseline) to 67.6%.Conclusions:Utilizing natural language processing for data harmonization provides a scalable approach that improves accuracy and efficiency of prediction. This facilitates the inclusion of diverse cohorts, broadening the sample size and range of indicators available, which in turn advances the versatility and the generalizability of stroke risk prediction models.
peripheral vascular disease,clinical neurology
What problem does this paper attempt to address?