Abstract:Background Multi-site studies facilitate the study of rare outcomes or exposures through integrating patient information from several distinct care sites. Due to patient privacy concerns, sharing of patient-level information among collaborating sites is often prohibited, suggesting a need for privacy-preserving data analysis methods. Several such methods exist, but have been shown to sometimes result in biased estimation or require extensive communication among sites. Objective We present a communication-efficient, privacy-preserving method for performing distributed regression on Electronic Health Records (EHR) data across multiple sites for zero-inflated count outcomes. Our approach is motivated by two real-world data problems: examining risk factors associated with pediatric avoidable hospitalization and modeling frequency of serious adverse events in colorectal cancer patients. Methods We use hurdle regression, a two-part (logistic-Poisson) regression model, to characterize the effects of risk factors on zero-inflated count outcomes. We develop a one-shot algorithm for performing hurdle regression (ODAH) across multiple sites, using individual patient data at one site and aggregated data from all other sites to approximate the complete data log likelihood. We evaluate ODAH through extensive simulations and an application to EHR data from the Children's Hospital of Philadelphia (CHOP) and the OneFlorida Clinical Research Consortium. We compare ODAH estimates to those from meta-analysis and pooled analysis (all patient data pooled together, the gold standard). Results In simulations, ODAH estimates exhibited bias relative to the gold standard of less than 0.1% across several settings. In contrast, meta-analysis estimated exhibited relative bias up to 12.7%, largely dependent on event rate. When applying ODAH to CHOP data, relative biases for estimates in both components of the hurdle model were less than 5.1%, while meta-analysis estimates exhibited relative bias as high as 63.6%. When analyzing OneFlorida data, ODAH relative biases were less than 10% for eight of the ten estimated coefficients, while meta-analysis estimates again showed substantially greater bias. Conclusions Our simulations and real-world applications suggest ODAH is a promising method for performing privacy-preserving distributed learning on EHR data when modeling zero-inflated count outcomes.

Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?

Adaptive sample size determination for the development of clinical prediction models

Calculating the sample size required for developing a clinical prediction model

Sample size for developing a prediction model with a binary outcome: targeting precise individual risk estimates to improve clinical decisions and fairness

Sample size for binary logistic prediction models: Beyond events per variable criteria

Minimum sample size for developing a multivariable prediction model using multinomial logistic regression

A practical solution to estimate the sample size required for clinical prediction models generated from observational research on data

Sample Size Guidelines for Logistic Regression from Observational Studies with Large Population: Emphasis on the Accuracy Between Statistics and Parameters Based on Real Life Clinical Data

Sample size considerations and predictive performance of multinomial logistic prediction models

Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease

Sample size requirements are not being considered in studies developing prediction models for binary outcomes: a systematic review

Minimum sample size for external validation of a clinical prediction model with a continuous outcome

Spectroscopic and kinetic aspects of Elephas maximus hemoglobin.

Distributed Learning from Multi-Site Observational Health Data for Zero-Inflated Count Outcomes

Minimum sample size for external validation of a clinical prediction model with a binary outcome

A comparison of approaches to improve worst-case predictive model performance over patient subpopulations

Dataset size versus homogeneity: A machine learning study on pooling intervention data in e-mental health dropout predictions

The path toward generalizable clinical prediction models

Minimum sample size calculations for external validation of a clinical prediction model with a time‐to‐event outcome

Comparison of deep learning and conventional methods for disease onset prediction

Sample Size in Natural Language Processing within Healthcare Research