Abstract:ABSTRACT Background Electronic health records (EHRs) promise to enable broad-ranging discovery with power exceeding that of conventional research cohort studies. However, research using EHR datasets may be subject to selection bias, which can be compounded by missing data, limiting the generalizability of derived insights. Methods Mass General Brigham (MGB) is a large New England-based healthcare network comprising seven tertiary care and community hospitals with associated outpatient practices. Within an MGB-based EHR warehouse of >3.5 million individuals with at least one ambulatory care visit, we approximated a community-based cohort study by selectively sampling individuals longitudinally attending primary care practices between 2001-2018 (n=520,868), which we named the Community Care Cohort Project (C3PO). We also utilized pre-trained deep natural language processing (NLP) models to recover vital signs (i.e., height, weight, and blood pressure) from unstructured notes in the EHR. We assessed the validity of C3PO by deploying established risk models including the Pooled Cohort Equations (PCE) and the Cohorts for Aging and Genomic Epidemiology Atrial Fibrillation (CHARGE-AF) score, and compared model performance in C3PO to that observed within typical EHR Convenience Samples which included all individuals from the same parent EHR with sufficient data to calculate each score but without a requirement for longitudinal primary care. All analyses were facilitated by the JEDI Extractive Data Infrastructure pipeline which we designed to efficiently aggregate EHR data within a unified framework conducive to regular updates. Results C3PO includes 520,868 individuals (mean age 48 years, 61% women, median follow-up 7.2 years, median primary care visits per individual 13). Estimated using reports, C3PO contains over 2.9 million electrocardiograms, 450,000 echocardiograms, 12,000 cardiac magnetic resonance images, and 75 million narrative notes. Using tabular data alone, 286,009 individuals (54.9%) had all vital signs available at baseline, which increased to 358,411 (68.8%) after NLP recovery (31% reduction in missingness). Among individuals with both NLP and tabular data available, NLP-extracted and tabular vital signs obtained on the same day were highly correlated (e.g., Pearson r range 0.95-0.99, p<0.01 for all). Both the PCE models (c-index range 0.724-0.770) and CHARGE-AF (c-index 0.782, 95% 0.777-0.787) demonstrated good discrimination. As compared to the Convenience Samples, AF and MI/stroke incidence rates in C3PO were lower and calibration error was smaller for both PCE (integrated calibration index range 0.012-0.030 vs. 0.028-0.046) and CHARGE-AF (0.028 vs. 0.036). Conclusions Intentional sampling of individuals receiving regular ambulatory care and use of NLP to recover missing data have the potential to reduce bias in EHR research and maximize generalizability of insights.

PRISM: Mitigating EHR Data Sparsity Via Learning from Missing Feature Calibrated Prototype Patient Representations

PRISM: Leveraging Prototype Patient Representations with Feature-Missing-Aware Calibration for EHR Data Sparsity Mitigation

Leveraging Prototype Patient Representations with Feature-Missing-Aware Calibration to Mitigate EHR Data Sparsity.

Deep Dynamic Patient Similarity Analysis: Model Development and Validation in ICU.

PRISM: Privacy Preserving Healthcare Internet of Things Security Management

SMART: Towards Pre-trained Missing-Aware Model for Patient Health Status Prediction

Learnable Prompt as Pseudo-Imputation: Reassessing the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction

HealthPrism: A Visual Analytics System for Exploring Children's Physical and Mental Health Profiles with Multimodal Data

PRISM: Patient Response Identifiers for Stratified Medicine

PRISM: Privacy-preserving Inter-Site MRI Harmonization via Disentangled Representation Learning

Mixed-Integer Projections for Automated Data Correction of EMRs Improve Predictions of Sepsis among Hospitalized Patients

PiRL: Participant-Invariant Representation Learning for Healthcare

Predict and Interpret Health Risk using EHR through Typical Patients

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

FairCare: Adversarial training of a heterogeneous graph neural network with attention mechanism to learn fair representations of electronic health records

Projective Resampling Imputation Mean Estimation Method for Missing Covariates Problem

Fair Patient Model: Mitigating Bias in the Patient Representation Learned from the Electronic Health Records

Imputation of missing values for electronic health record laboratory data

StratMed: Relevance stratification between biomedical entities for sparsity on medication recommendation

Grasp: Generic Framework For Health Status Representation Learning Based On Incorporating Knowledge From Similar Patients

Cohort Design and Natural Language Processing to Reduce Bias in Electronic Health Records Research: The Community Care Cohort Project