Abstract:Background Multi-site studies facilitate the study of rare outcomes or exposures through integrating patient information from several distinct care sites. Due to patient privacy concerns, sharing of patient-level information among collaborating sites is often prohibited, suggesting a need for privacy-preserving data analysis methods. Several such methods exist, but have been shown to sometimes result in biased estimation or require extensive communication among sites. Objective We present a communication-efficient, privacy-preserving method for performing distributed regression on Electronic Health Records (EHR) data across multiple sites for zero-inflated count outcomes. Our approach is motivated by two real-world data problems: examining risk factors associated with pediatric avoidable hospitalization and modeling frequency of serious adverse events in colorectal cancer patients. Methods We use hurdle regression, a two-part (logistic-Poisson) regression model, to characterize the effects of risk factors on zero-inflated count outcomes. We develop a one-shot algorithm for performing hurdle regression (ODAH) across multiple sites, using individual patient data at one site and aggregated data from all other sites to approximate the complete data log likelihood. We evaluate ODAH through extensive simulations and an application to EHR data from the Children's Hospital of Philadelphia (CHOP) and the OneFlorida Clinical Research Consortium. We compare ODAH estimates to those from meta-analysis and pooled analysis (all patient data pooled together, the gold standard). Results In simulations, ODAH estimates exhibited bias relative to the gold standard of less than 0.1% across several settings. In contrast, meta-analysis estimated exhibited relative bias up to 12.7%, largely dependent on event rate. When applying ODAH to CHOP data, relative biases for estimates in both components of the hurdle model were less than 5.1%, while meta-analysis estimates exhibited relative bias as high as 63.6%. When analyzing OneFlorida data, ODAH relative biases were less than 10% for eight of the ten estimated coefficients, while meta-analysis estimates again showed substantially greater bias. Conclusions Our simulations and real-world applications suggest ODAH is a promising method for performing privacy-preserving distributed learning on EHR data when modeling zero-inflated count outcomes.

An Approximate Quasi-Likelihood Approach for Error-Prone Failure Time Outcomes and Exposures

Semiparametric time to event models in the presence of error-prone, self-reported outcomes - With application to the women's health initiative

Learning Models from Data with Measurement Error: Tackling Underreporting

Empirical likelihood inference for longitudinal data with covariate measurement errors: An application to the LEAN study

Epidemiologic analyses with error-prone exposures: Review of current practice and recommendations

Measurement Error in Nutritional Epidemiology: A Survey

A linear mixed model approach for measurement error adjustment: applications to sedentary behavior assessment from wearable devices

Distributed Learning from Multi-Site Observational Health Data for Zero-Inflated Count Outcomes

Structural properties of nondeterministic complete sets

Enhanced individual trabecular repair and its mechanical implications in parathyroid hormone and alendronate treated rat tibial bone.

Measurement Error as a Missing Data Problem

A Quantitative Bias Analysis Approach to Informative Presence Bias in Electronic Health Records

Semiparametric Methods for Exposure Misclassification in Propensity Score-Based Time-to-Event Data Analysis

Cardiac manifestations of ulcerative colitis.

Histological effects of calcitonin in bone diseases

Analysis of Longitudinal Data with Covariate Measurement Error and Missing Responses: an Improved Unbiased Estimating Equation.

Regression calibration to correct correlated errors in outcome and exposure

A Discriminant Function Approach to Adjust for Processing and Measurement Error When a Biomarker is Assayed in Pooled Samples

Risk Prediction with Imperfect Survival Outcome Information from Electronic Health Records

Generalized Linear Models with Covariate Measurement Error and Zero-Inflated Surrogates

A Function-Based Approach to Model the Measurement Error in Wearable Devices