Abstract:The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

Extracting and Integrating Data from Entire Electronic Health Records for Detecting Colorectal Cancer Cases.

Natural Language Processing to Identify Abnormal Breast, Lung, and Cervical Cancer Screening Test Results from Unstructured Reports to Support Timely Follow-up.

Natural language processing improves identification of colorectal cancer testing in the electronic medical record

Early detection of non-small cell lung cancer using electronic health record data

Natural Language Processing Accurately Categorizes Indications, Findings and Pathology Reports from Multicenter Colonoscopy

Using Natural Language Processing to Extract Clinically Useful Information from Chinese Electronic Medical Records

Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data

A Scalable Workflow to Build Machine Learning Classifiers with Clinician-in-the-Loop to Identify Patients in Specific Diseases

Using Natural Language Processing to Screen Patients with Active Heart Failure: An Exploration for Hospital-wide Surveillance

Improving Colorectal Cancer Screening and Risk Assessment through Predictive Modeling on Medical Images and Records

Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Monopoles in topologically massive gauge theories.

Automated feature selection of predictors in electronic medical records data

Using Clinical Narratives and Structured Data to Identify Distant Recurrences in Breast Cancer

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical Applications

Development and Validation of Machine Learning Algorithms for Prediction of Colorectal Polyps Based on Electronic Health Records

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Automatic Infection Detection Based on Electronic Medical Records.

An Integrated Approach to the Detection of Colorectal Cancer Utilizing Proteomics and Bioinformatics

Uncovering Medical Insights from Vast Amounts of Biomedical Data in Clinical Case Reports