Conditioning on Time is All You Need for Synthetic Survival Data Generation

Mohd Ashhad,Ricardo Henao
2024-05-28
Abstract:Synthetic data generation holds considerable promise, offering avenues to enhance privacy, fairness, and data accessibility. Despite the availability of various methods for generating synthetic tabular data, challenges persist, particularly in specialized applications such as survival analysis. One significant obstacle in survival data generation is censoring, which manifests as not knowing the precise timing of observed (target) events for certain instances. Existing methods face difficulties in accurately reproducing the real distribution of event times for both observed (uncensored) events and censored events, i.e., the generated event-time distributions do not accurately match the underlying distributions of the real data. So motivated, we propose a simple paradigm to produce synthetic survival data by generating covariates conditioned on event times (and censoring indicators), thus allowing one to reuse existing conditional generative models for tabular data without significant computational overhead, and without making assumptions about the (usually unknown) generation mechanism underlying censoring. We evaluate this method via extensive experiments on real-world datasets. Our methodology outperforms multiple competitive baselines at generating survival data, while improving the performance of downstream survival models trained on it and tested on real data.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges in synthetic survival data analysis, especially how to generate high - quality synthetic survival data. Specifically, the paper focuses on the following core issues: 1. **Generating event times with a true distribution**: Existing synthetic data generation methods have difficulty accurately reproducing the distribution of event times (including uncensored and censored events) in real data when dealing with survival data. This means that the generated data may not faithfully reflect the time characteristics of the original data. 2. **Handling censored data**: An important issue in survival analysis is the existence of censored data, that is, the event times of some individuals are partially or completely unknown. Existing methods have difficulties in handling such censored data, resulting in a mismatch between the generated event - time distribution and the actual data. 3. **Improving the performance of downstream models**: The generated synthetic data should be able to be used to train survival analysis models, and the performance of these models on real data should be better than or at least not worse than that of models trained directly with real data. Therefore, generating high - quality synthetic data is crucial for improving the performance of downstream models. To solve the above problems, the authors propose a conditional - generation - based method. By generating covariates and conditioning on event times and censoring indicators, it is ensured that the generated data can better match the distribution of the real data. The specific steps are as follows: - **Sampling event times and censoring indicators**: Sample event time \(\tilde{t}\) and censoring indicator \(\tilde{e}\) from the empirical distribution. - **Generating covariates**: Use a conditional generation model to generate covariates \(\tilde{x}\) according to the sampled event times and censoring indicators. This method not only simplifies the generation process but also avoids dependence on specific generation networks, while ensuring that the generated event - time and censoring distributions are consistent with the real data. Experimental results show that this method significantly outperforms existing baseline methods on multiple real - world datasets, especially in terms of the quality of the generated data and the performance of downstream models. ### Formula summary - **Joint distribution of event times and censoring indicators**: \[ \tilde{e}\sim p(e),\quad\tilde{t}\sim p(t|\tilde{e}),\quad u\sim p(u),\quad\tilde{x}\sim p_{\theta}(x|\tilde{t},\tilde{e},u) \] - **Survival function**: \[ S(t|x)=\int_{t}^{\infty}p(t'|x)\,dt' \] - **Expected lifespan**: \[ \mu(x)=\int_{0}^{\infty}t'p(t'|x)\,dt'=\int_{0}^{\infty}S(t|x)\,dt \] Through these formulas and methods, the paper successfully solves the key problem of generating high - quality synthetic survival data.