Communication-Efficient Distributed Estimation and Inference for Cox's Model

Pierre Bayle,Jianqing Fan,Zhipeng Lou
2024-06-24
Abstract:Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods.
Methodology,Statistics Theory,Applications,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to develop communication - efficient distributed algorithms for estimation and inference in high - dimensional sparse Cox proportional hazards models in the case where personal data cannot be shared due to privacy and ownership issues in multi - center biomedical research. Specifically, the authors propose an iterative distributed algorithm to solve the following problems: 1. **Parameter Estimation**: Develop a communication - efficient iterative distributed algorithm in the high - dimensional sparse Cox proportional hazards model, so that the estimator can reach the same convergence rate as the ideal full - sample estimator after relatively few iterations. 2. **Confidence Interval Construction**: Propose a novel de - biasing method, establish the central limit theorem, and provide a consistent variance estimator to generate asymptotically valid distributed confidence intervals for linear combinations of high - dimensional risk regression coefficients. 3. **Hypothesis Testing**: Based on the decorrelated score test, provide an effective distributed hypothesis testing method for testing any coordinate element. ### Main Contributions 1. **Communication - Efficient Iterative Distributed Algorithm**: The authors develop an iterative distributed algorithm for parameter estimation in the high - dimensional sparse Cox proportional hazards model and prove that under relatively mild conditions, this estimator can reach the same convergence rate as the ideal full - sample estimator. In particular, this algorithm does not require a small number of centers. 2. **Confidence Interval Construction**: Through the de - biasing method, construct confidence intervals for linear combinations of high - dimensional risk regression coefficients, establish the central limit theorem, and provide a variance estimator to ensure the asymptotic validity of the confidence intervals. 3. **Hypothesis Testing**: Based on the decorrelated score test, propose a hypothesis testing method for testing any coordinate element and prove its asymptotic validity. ### Background and Motivation In the context of large - scale data sets becoming increasingly common, storing, computing, and analyzing these data become challenges. Especially in the medical science field, when hospitals and laboratories process clinical and genomic data, they may not be able to share patient - level information due to privacy and ownership issues. Therefore, it is particularly important to develop communication - efficient distributed algorithms that can perform statistical analysis without sharing personal data. ### Methods and Techniques 1. **Cox Proportional Hazards Model**: The Cox proportional hazards model is a widely used semi - parametric model for the analysis of time - to - event outcomes. The model assumes that the conditional hazard function has a specific form: \[ \lambda(t|x(t))=\lambda_0(t)\exp\{x(t)^{\top}\beta^*\} \] where \(\lambda_0(t)\) is the baseline hazard function and \(\beta^*\) is the population parameter vector. 2. **Distributed Environment**: Data are distributed across multiple centers. Each center independently processes part of the data and sends local results to a central processor for aggregation through communication. 3. **Iterative Algorithm**: Use the gradient - enhanced loss (GEL) function to gradually improve the accuracy of the estimator through multiple communication iterations. 4. **Statistical Inference**: Through the de - biasing method and the decorrelated score test, construct confidence intervals and perform hypothesis testing to ensure the validity of statistical inference. ### Experimental Results Through numerical experiments, the performance of the proposed distributed estimator, confidence intervals, and hypothesis testing methods on simulated and real data is verified, indicating that these methods are superior to other existing methods. In summary, this paper makes important contributions to the estimation and inference of communication - efficient distributed Cox proportional hazards models and solves key problems in multi - center biomedical research.