Comparing Machine Learning Models for Identifying Chronic Cough Using Diagnosis and Medication in the Electronic Health Records
Vishal Bali,Xiao Luo,Priyanka Gandhi,Zuoyi Zhang,Wei Shao,Zhi Han,Vasu Chandrasekaran,Vladimir Turzhitsky,Anna Roberts,Megan Metzger,Jarod Baker,Carmen La Rosa,Jessica Weaver,Paul Dexter,Kun Huang
DOI: https://doi.org/10.1016/j.jaci.2020.12.241
IF: 14.29
2021-01-01
Journal of Allergy and Clinical Immunology
Abstract:Chronic cough (CC), a cough of eight or more weeks is often multifactorial and can impair patients’ quality of life. We investigated the potential of machine learning models for CC prediction using the diagnoses and medications in the electronic health records (EHRs). We constructed two cohorts of 18-85 years old patients using EHRs of a large statewide academic system and a public county hospital. 23,573 CC patients were identified by the rule-based algorithm with an outpatient visit and available medical history of 120 days from the index date; and an equal number of patients with a cough that did not meet the criteria of CC. We used diagnosis and medication data to simulate claims data for CC prediction. We used the original diagnosis name to capture the contextual information between diagnosis and medication. The medication information is standardized with the National Drug Code Directory, which was then mapped to the medication category. An NLP method was used to construct the data representation for the learning models. The machine learning models, including Logistic Regression (LR), Support Vector Machine (SVM), k-Nearest Neighbor (kNN), and Random Forest (RF) were compared. LR, SVM, kNN and RF gained sensitivity of 0.83, 0.85, 0.71 and 0.84, respectively, and specificity of 0.78, 0.81, 0.67 and 0.79, respectively. SVM performed better than the other three models. SVM by using only medication and diagnosis in the claims data can predict majority of the CC patients. Additionally, symptoms extracted from clinical notes may further improve performance of the models.