Abstract:We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to an extraordinarily large time-independent survival dataset and an extraordinarily large time-dependent survival dataset for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients.

What problem does this paper attempt to address?

This paper aims to solve the computational efficiency problems encountered when fitting sparse Cox regression models on large - scale datasets. Specifically, when the sample size $n_0$ is very large and the covariate dimension $p$ is not small but $n_0\gg p$, it is computationally infeasible to directly use traditional methods to fit the Cox model with adaptive LASSO penalty. To solve this problem, the author proposes a new algorithm based on the Divide - and - Conquer (DAC) strategy, called the **DAC lin** algorithm. Through a series of linearization steps, this algorithm can significantly reduce the computational burden while maintaining statistical efficiency similar to that of the full - sample estimator. ### Main contributions: 1. **Improved computational efficiency**: Through the divide - and - conquer strategy and linearization approximation, the DAC lin algorithm can significantly improve computational efficiency when dealing with large - scale datasets. 2. **Wide range of applicability**: This algorithm is applicable not only to time - independent survival data, but also to time - dependent survival data. 3. **Excellent statistical performance**: Simulation results show that the DAC lin algorithm is significantly superior to existing full - sample estimators and other DAC algorithms in terms of computational speed, and also performs well in terms of statistical performance. ### Application examples: The author applies the proposed DAC lin algorithm to a large - scale dataset containing more than 9.5 million Medicare patients to predict the risk of readmission or death due to heart failure within 30 days after discharge for heart failure patients. The results show that this algorithm can effectively identify multiple covariates related to the risk of readmission or death within 30 days, and the computational efficiency is much higher than that of the traditional full - sample method. ### Specific problem - solving: - **High computational complexity**: Directly fitting the adaptive LASSO - penalized Cox model on large - scale datasets has high computational complexity and is difficult to complete within a reasonable time. - **Maintaining statistical efficiency**: While improving computational efficiency, maintain statistical efficiency similar to that of the full - sample estimator to ensure the prediction accuracy of the model. Through these improvements, the DAC lin algorithm provides an efficient and reliable solution for processing large - scale survival data analysis.

A Fast Divide-and-Conquer Sparse Cox Regression

Learning from Local to Global - an Efficient Distributed Algorithm for Modeling Time-to-event Data

An optimal subsampling design for large-scale Cox model with censored data

Distributed Estimation for Large-Scale Cox Regression with Poisson Subsampling

Fast Sparse-Group Lasso Method for Multi-response Cox Model with Applications to UK Biobank

Statistically Guided Divide-and-Conquer for Sparse Factorization of Large Matrix

False Discovery Rate Control for High-Dimensional Cox Model with Uneven Data Splitting

Communication-Efficient Distributed Estimation and Inference for Cox's Model

Fitting the Cox proportional hazards model to big data

Optimal Cox Regression Subsampling Procedure with Rare Events

Structured learning in time‐dependent Cox models

Optimal subsampling for the Cox proportional hazards model with massive survival data

A scalable and flexible Cox proportional hazards model for high-dimensional survival prediction and functional selection

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA.

Bayesian Cox Regression for Large-scale Inference with Applications to Electronic Health Records

An online framework for survival analysis: reframing Cox proportional hazards model for large data sets and neural networks

FastCPH: Efficient Survival Analysis for Neural Networks

Extreme Learning Machine Cox Model for High-Dimensional Survival Analysis

FastSurvival: Hidden Computational Blessings in Training Cox Proportional Hazards Models

A Soft-Thresholding Operator for Sparse Time-Varying Effects in Survival Models

Univariate shrinkage in the Cox model for high dimensional data