Abstract P259: Using Natural Language Processing and Machine Learning to Identify Incident Stroke from Electronic Health Records

Yiqing Zhao,Sunyang Fu,Suzette J. Bielinski,Paul A. Decker,Alanna M. Chamberlain,Véronique L. Roger,Hongfang Liu,Nicolas B Larson
DOI: https://doi.org/10.1161/circ.141.suppl_1.p259
IF: 37.8
2020-01-01
Circulation
Abstract:Background: The focus of most existing phenotyping algorithms based on electronic health record (EHR) data has been to accurately identify cases and non-cases of specific diseases. However, a more challenging task is to accurately identify disease incidence, as identifying the first occurrence of disease is more important for efficient and valid clinical and epidemiological research. Moreover, stroke is a challenging phenotype due to diagnosis difficulty and common miscoding. This task generally requires utilization of multiple types of EHR data (e.g., diagnoses and procedure codes, unstructured clinical notes) and a more robust algorithm integrating both natural language processing and machine learning. In this study, we developed and validated an EHR-based classifier to accurately identify stroke incidence among a cohort of atrial fibrillation (AF) patients Methods: We developed a stroke phenotyping algorithm using International Classification of Diseases, Ninth Revision (ICD-9) codes, Current Procedural Terminology (CPT) codes, and expert-provided keywords as model features. Structured data was extracted from Rochester Epidemiology Project (REP) database. Natural Language Processing (NLP) was used to extract and validate keyword occurrence in clinical notes. A window of ±30 days was considered when including/excluding keywords/codes into the input vector. Frequencies of keywords/codes were used as input feature sets for model training. Multiple competing models were trained using various combinations of feature sets and two machine learning algorithms: logistic regression and random forest. Training data were provided by two nurse abstractors and included validated stroke incidences from a previously established atrial fibrillation cohort. Precision, recall, and F-score of the algorithm were calculated to assess and compare model performances. Results: Among 4,914 patients with atrial fibrillation, 1,773 patients were screened. 3,141 patients had no stroke-related codes or keywords and were presumed to be free of stroke during follow-up. Among the screened patients, 740 had validated strokes and 1,033 did not have a stroke based on review of the EHR by trained nurse abstractors. The best performing stroke incidence phenotyping classifier utilized Keywords+ICD-9+CPT features using a random forest classifier, achieving a precision of 0.942, recall of 0.943, and F-score of 0.943. Conclusion: In conclusion, we developed and validated a stroke algorithm that performed well for identifying stroke incidence in an enriched population (AF cohort), which extends beyond the typical binary case/non-case stroke identification problem. Future work will involve testing the generalizability of this algorithm in a general population.
What problem does this paper attempt to address?