Abstract:Synthetic data generation holds considerable promise, offering avenues to enhance privacy, fairness, and data accessibility. Despite the availability of various methods for generating synthetic tabular data, challenges persist, particularly in specialized applications such as survival analysis. One significant obstacle in survival data generation is censoring, which manifests as not knowing the precise timing of observed (target) events for certain instances. Existing methods face difficulties in accurately reproducing the real distribution of event times for both observed (uncensored) events and censored events, i.e., the generated event-time distributions do not accurately match the underlying distributions of the real data. So motivated, we propose a simple paradigm to produce synthetic survival data by generating covariates conditioned on event times (and censoring indicators), thus allowing one to reuse existing conditional generative models for tabular data without significant computational overhead, and without making assumptions about the (usually unknown) generation mechanism underlying censoring. We evaluate this method via extensive experiments on real-world datasets. Our methodology outperforms multiple competitive baselines at generating survival data, while improving the performance of downstream survival models trained on it and tested on real data.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the key challenges in synthetic survival data analysis, especially how to generate high - quality synthetic survival data. Specifically, the paper focuses on the following core issues: 1. **Generating event times with a true distribution**: Existing synthetic data generation methods have difficulty accurately reproducing the distribution of event times (including uncensored and censored events) in real data when dealing with survival data. This means that the generated data may not faithfully reflect the time characteristics of the original data. 2. **Handling censored data**: An important issue in survival analysis is the existence of censored data, that is, the event times of some individuals are partially or completely unknown. Existing methods have difficulties in handling such censored data, resulting in a mismatch between the generated event - time distribution and the actual data. 3. **Improving the performance of downstream models**: The generated synthetic data should be able to be used to train survival analysis models, and the performance of these models on real data should be better than or at least not worse than that of models trained directly with real data. Therefore, generating high - quality synthetic data is crucial for improving the performance of downstream models. To solve the above problems, the authors propose a conditional - generation - based method. By generating covariates and conditioning on event times and censoring indicators, it is ensured that the generated data can better match the distribution of the real data. The specific steps are as follows: - **Sampling event times and censoring indicators**: Sample event time \(\tilde{t}\) and censoring indicator \(\tilde{e}\) from the empirical distribution. - **Generating covariates**: Use a conditional generation model to generate covariates \(\tilde{x}\) according to the sampled event times and censoring indicators. This method not only simplifies the generation process but also avoids dependence on specific generation networks, while ensuring that the generated event - time and censoring distributions are consistent with the real data. Experimental results show that this method significantly outperforms existing baseline methods on multiple real - world datasets, especially in terms of the quality of the generated data and the performance of downstream models. ### Formula summary - **Joint distribution of event times and censoring indicators**: \[ \tilde{e}\sim p(e),\quad\tilde{t}\sim p(t|\tilde{e}),\quad u\sim p(u),\quad\tilde{x}\sim p_{\theta}(x|\tilde{t},\tilde{e},u) \] - **Survival function**: \[ S(t|x)=\int_{t}^{\infty}p(t'|x)\,dt' \] - **Expected lifespan**: \[ \mu(x)=\int_{0}^{\infty}t'p(t'|x)\,dt'=\int_{0}^{\infty}S(t|x)\,dt \] Through these formulas and methods, the paper successfully solves the key problem of generating high - quality synthetic survival data.

Conditioning on Time is All You Need for Synthetic Survival Data Generation

Deep Generative Survival Analysis: Nonparametric Estimation of Conditional Survival Function

Learning to rank for censored survival data

Copula-based Semiparametric Nonnormal Transformed Linear Model for Survival Data with Dependent Censoring

Synthetic Event Time Series Health Data Generation

Copula-Based Deep Survival Models for Dependent Censoring

CondiS Web App: Imputation of Censored Lifetimes for Machine Learning-Based Survival Analysis

CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare

Survival Trees for Interval-Censored Survival data

CondiS: A conditional survival distribution-based method for censored data imputation overcoming the hurdle in machine learning-based survival analysis

Toward Conditional Distribution Calibration in Survival Prediction

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

An Introduction to Deep Survival Analysis Models for Predicting Time-to-Event Outcomes

The Concordance Index decomposition: A measure for a deeper understanding of survival prediction models

MENSA: A Multi-Event Network for Survival Analysis under Informative Censoring

Utility Assessment of Synthetic Data Generation Methods

Enabling Counterfactual Survival Analysis with Balanced Representations

Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy

Improving Event Time Prediction by Learning to Partition the Event Time Space

Generating synthetic multi-dimensional molecular-mediator time series data for artificial intelligence-based disease trajectory forecasting and drug development digital twins: Considerations