Abstract:Background: The focus of most existing phenotyping algorithms based on electronic health record (EHR) data has been to accurately identify cases and non-cases of specific diseases. However, a more challenging task is to accurately identify disease incidence, as identifying the first occurrence of disease is more important for efficient and valid clinical and epidemiological research. Moreover, stroke is a challenging phenotype due to diagnosis difficulty and common miscoding. This task generally requires utilization of multiple types of EHR data (e.g., diagnoses and procedure codes, unstructured clinical notes) and a more robust algorithm integrating both natural language processing and machine learning. In this study, we developed and validated an EHR-based classifier to accurately identify stroke incidence among a cohort of atrial fibrillation (AF) patients Methods: We developed a stroke phenotyping algorithm using International Classification of Diseases, Ninth Revision (ICD-9) codes, Current Procedural Terminology (CPT) codes, and expert-provided keywords as model features. Structured data was extracted from Rochester Epidemiology Project (REP) database. Natural Language Processing (NLP) was used to extract and validate keyword occurrence in clinical notes. A window of ±30 days was considered when including/excluding keywords/codes into the input vector. Frequencies of keywords/codes were used as input feature sets for model training. Multiple competing models were trained using various combinations of feature sets and two machine learning algorithms: logistic regression and random forest. Training data were provided by two nurse abstractors and included validated stroke incidences from a previously established atrial fibrillation cohort. Precision, recall, and F-score of the algorithm were calculated to assess and compare model performances. Results: Among 4,914 patients with atrial fibrillation, 1,773 patients were screened. 3,141 patients had no stroke-related codes or keywords and were presumed to be free of stroke during follow-up. Among the screened patients, 740 had validated strokes and 1,033 did not have a stroke based on review of the EHR by trained nurse abstractors. The best performing stroke incidence phenotyping classifier utilized Keywords+ICD-9+CPT features using a random forest classifier, achieving a precision of 0.942, recall of 0.943, and F-score of 0.943. Conclusion: In conclusion, we developed and validated a stroke algorithm that performed well for identifying stroke incidence in an enriched population (AF cohort), which extends beyond the typical binary case/non-case stroke identification problem. Future work will involve testing the generalizability of this algorithm in a general population.

Abstract WP236: Harmonization of Stroke Risk Prediction Variables Using Natural Language Processing

Stroke Risk Prediction Using Machine Learning Algorithms

Abstract MP15: Validation of Phenotyping Algorithms for Stroke from Electronic Health Records Using Natural Language Processing

Natural Language Processing Enhances Prediction of Functional Outcome After Acute Ischemic Stroke

Automated Extraction of Stroke Severity From Unstructured Electronic Health Records Using Natural Language Processing

Exploring Machine Learning for Predicting Cerebral Stroke: A Study in Discovery

Abstract P259: Using Natural Language Processing and Machine Learning to Identify Incident Stroke from Electronic Health Records

Application of machine learning and natural language processing for predicting stroke-associated pneumonia

Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center

Novel Insights on Establishing Machine Learning-Based Stroke Prediction Models Among Hypertensive Adults

A Natural Language Processing Approach to Support Biomedical Data Harmonization: Leveraging Large Language Models

Automating Ischemic Stroke Subtype Classification Using Machine Learning and Natural Language Processing

Automating Stroke Data Extraction From Free-Text Radiology Reports Using Natural Language Processing: Instrument Validation Study

A global effort to benchmark predictive models and reveal mechanistic diversity in long-term stroke outcomes

Abstract WP266: Prediction of Stroke Incidence Using Machine Learning: The Suita Study

Enriching the Study Population for Ischemic Stroke Therapeutic Trials Using a Machine Learning Algorithm

A Hybrid Machine Learning Approach to Cerebral Stroke Prediction Based on Imbalanced Medical Dataset.

An Improved Concatenation of Deep Learning Models for Predicting and Interpreting Ischemic Stroke

Abstract 106: Novel Imaging-based Morphological Markers For Improved Prediction Of Stroke Outcomes: A Machine Learning Approach

NeuroHealth guardian: A novel hybrid approach for precision brain stroke prediction and healthcare analytics

Predicting recovery following stroke: deep learning, multimodal data and feature selection using explainable AI