Replica analysis of overfitting in regression models for time to event data: the impact of censoring

Emanuele Massa,Alexander Mozeika,Anthony Coolen
2023-12-06
Abstract:We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) RS equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators {without} knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.
Methodology,Disordered Systems and Neural Networks,Statistics Theory
What problem does this paper attempt to address?
This paper attempts to address the issue of overfitting in high-dimensional time-to-event data (such as survival analysis in medical research) when there is censoring. Specifically, the paper models the impact of censoring on overfitting using statistical mechanics methods (particularly the replica method) and proposes a new algorithm to correct the bias of the maximum likelihood estimator (MLE), thereby improving the model's accuracy in predicting new data. ### Main Research Content: 1. **Overfitting Problem**: In high-dimensional data, when the ratio of the number of covariates to the number of samples is large, traditional maximum likelihood estimation methods lead to biased parameter estimates, which affects the model's predictive ability. 2. **Impact of Censoring**: In actual medical data, censoring is very common (e.g., patients may drop out of the study before it ends), making previous overfitting analysis methods difficult to apply directly to real data. 3. **Statistical Mechanics Methods**: The paper uses the replica method from statistical mechanics to model the impact of censoring on overfitting and derives extended RS equations. 4. **Algorithm Development**: A new numerical algorithm is developed that can self-consistently approximate all relevant parameters and solve the RS equations without knowing the data-generating distribution, thereby correcting the bias of the MLE. 5. **Empirical Validation**: Numerical simulations validate the accuracy of the theoretical predictions and demonstrate the effectiveness of the corrected estimator in removing bias. ### Research Significance: - **Theoretical Contribution**: Extends existing overfitting analysis theories to handle data with censoring, enhancing the practicality and applicability of the theory. - **Practical Application**: Provides an effective tool for correcting MLE bias in high-dimensional data, thereby improving the model's predictive performance in actual medical research. Through this research, the paper provides a new theoretical foundation and practical tools for addressing the overfitting problem in high-dimensional time-to-event data, which is particularly valuable in medical research.