Abstract:Introduction: Peripheral arterial disease (PAD) is the leading cause of amputation in the United States. Despite affecting 8.5 million Americans and more than 200 million people globally, there are significant gaps in awareness by both patients and providers. Ongoing efforts to raise PAD awareness among both the public and health-care professionals have not met widespread success. Thus, there is a need for alternative methods for identifying PAD patients. One potentially promising strategy leverages natural language processing (NLP) to digitally screen patients for PAD. Prior approaches have applied keyword search (KWS) to billing codes or unstructured clinical narratives to identify patients with PAD. However, KWS is limited by its lack of flexibility, the need for manual algorithm development, inconsistent validation, and an inherent failure to capture patients with undiagnosed PAD. Recent advances in deep learning (DL) allow modern NLP models to learn a conceptual representation of the verbiage associated with PAD. This capability may overcome the characteristic constraints of applying strict rule-based algorithms (i.e., searching for a disease-defining set of keywords or billing codes) to real-world clinical data. Herein, we investigate the use of DL to identify patients with PAD from unstructured notes in the electronic health record (EHR). Methods: Using EHR data from a statewide health information exchange, we first created a dataset of all patients with diagnostic or procedural codes (International Classification of Diseases version 9 or 10 or Current Procedural Terminology) for PAD. This study population was then subdivided into training (70%) and testing (30%) cohorts. We based ground truth labels (PAD versus no PAD) on the presence of a primary diagnostic or procedural billing code for PAD at the encounter level. We implemented our KWS-based identification strategy using the currently published state-of-the-art algorithm for identifying PAD cases from unstructured EHR data. We developed a DL model using a BioMed-RoBERTa base that was fine-tuned on the training cohort. We compared the performance of the KWS algorithm to our DL model on a binary classification task (PAD versus no PAD). Results: Our study included 484,363 encounters across 71,355 patients represented in 2,268,062 notes. For the task of correctly identifying PAD related notes in our testing set, the DL outperformed KWS on all model performance measures (Sens 0.70 versus 0.62; Spec 0.99 versus 0.94; PPV 0.82 versus 0.69; NPV 0.97 versus 0.96; Accuracy 0.96 versus 0.91; P value for all comparisons <0.001). Conclusions: Our findings suggest that DL outperforms KWS for identifying PAD cases from clinical narratives. Future planned work derived from this project will develop models to stage patients based on clinical scoring systems.

Disease phenotyping using deep learning: A diabetes case study

A Deep Learning Approach to Diabetes Diagnosis

Comparing Rule-Based and Deep Learning Models for Patient Phenotyping

Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis

Stratification of diabetes in the context of comorbidities, using representation learning and topological data analysis

DiabetesNet: A Deep Learning Approach to Diabetes Diagnosis

Deep Learning Skin Disease Classifiers: Current Status and Future Prospects

Towards Automated ICD Coding Using Deep Learning

DiabDeep: Pervasive Diabetes Diagnosis based on Wearable Medical Sensors and Efficient Neural Networks

SynthA1c: Towards Clinically Interpretable Patient Representations for Diabetes Risk Stratification

Skin Disease Classification Using Deep Learning

A Deep Learning Model Incorporating Knowledge Representation Vectors and Its Application in Diabetes Prediction

A platform for phenotyping disease progression and associated longitudinal risk factors in large-scale EHRs, with application to incident diabetes complications in the UK Biobank

Deep Learning Skin Disease Classifiers: Current Status and Future Prospects (Preprint)

Deep learning imaging phenotype can classify metabolic syndrome and is predictive of cardiometabolic disorders

Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

ECG for high-throughput screening of multiple diseases: Proof-of-concept using multi-diagnosis deep learning from population-based datasets

A Machine Learning Approach for Prediction of Diabetes Mellitus

Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases

Use of Deep Learning to Identify Peripheral Arterial Disease Cases From Narrative Clinical Notes