A Fast Divide-and-Conquer Sparse Cox Regression

Yan Wang,Nathan Palmer,Qian Di,Joel Schwartz,Isaac Kohane,Tianxi Cai
DOI: https://doi.org/10.48550/arXiv.1804.00735
2018-04-03
Abstract:We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to an extraordinarily large time-independent survival dataset and an extraordinarily large time-dependent survival dataset for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients.
Computation,Applications
What problem does this paper attempt to address?
This paper aims to solve the computational efficiency problems encountered when fitting sparse Cox regression models on large - scale datasets. Specifically, when the sample size \(n_0\) is very large and the covariate dimension \(p\) is not small but \(n_0\gg p\), it is computationally infeasible to directly use traditional methods to fit the Cox model with adaptive LASSO penalty. To solve this problem, the author proposes a new algorithm based on the Divide - and - Conquer (DAC) strategy, called the **DAC lin** algorithm. Through a series of linearization steps, this algorithm can significantly reduce the computational burden while maintaining statistical efficiency similar to that of the full - sample estimator. ### Main contributions: 1. **Improved computational efficiency**: Through the divide - and - conquer strategy and linearization approximation, the DAC lin algorithm can significantly improve computational efficiency when dealing with large - scale datasets. 2. **Wide range of applicability**: This algorithm is applicable not only to time - independent survival data, but also to time - dependent survival data. 3. **Excellent statistical performance**: Simulation results show that the DAC lin algorithm is significantly superior to existing full - sample estimators and other DAC algorithms in terms of computational speed, and also performs well in terms of statistical performance. ### Application examples: The author applies the proposed DAC lin algorithm to a large - scale dataset containing more than 9.5 million Medicare patients to predict the risk of readmission or death due to heart failure within 30 days after discharge for heart failure patients. The results show that this algorithm can effectively identify multiple covariates related to the risk of readmission or death within 30 days, and the computational efficiency is much higher than that of the traditional full - sample method. ### Specific problem - solving: - **High computational complexity**: Directly fitting the adaptive LASSO - penalized Cox model on large - scale datasets has high computational complexity and is difficult to complete within a reasonable time. - **Maintaining statistical efficiency**: While improving computational efficiency, maintain statistical efficiency similar to that of the full - sample estimator to ensure the prediction accuracy of the model. Through these improvements, the DAC lin algorithm provides an efficient and reliable solution for processing large - scale survival data analysis.