Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence

Jan-Niklas Eckardt,Waldemar Hahn,Christoph Röllig,Sebastian Stasik,Uwe Platzbecker,Carsten Müller-Tidow,Hubert Serve,Claudia D. Baldus,Christoph Schliemann,Kerstin Schäfer-Eckart,Maher Hanoun,Martin Kaufmann,Andreas Burchert,Christian Thiede,Johannes Schetelig,Martin Sedlmayr,Martin Bornhäuser,Markus Wolfien,Jan Moritz Middeke
DOI: https://doi.org/10.1038/s41746-024-01076-x
IF: 15.2
2024-03-22
npj Digital Medicine
Abstract:Clinical research relies on high-quality patient data, however, obtaining big data sets is costly and access to existing data is often hindered by privacy and regulatory concerns. Synthetic data generation holds the promise of effectively bypassing these boundaries allowing for simplified data accessibility and the prospect of synthetic control cohorts. We employed two different methodologies of generative artificial intelligence – CTAB-GAN+ and normalizing flows (NFlow) – to synthesize patient data derived from 1606 patients with acute myeloid leukemia, a heterogeneous hematological malignancy, that were treated within four multicenter clinical trials. Both generative models accurately captured distributions of demographic, laboratory, molecular and cytogenetic variables, as well as patient outcomes yielding high performance scores regarding fidelity and usability of both synthetic cohorts ( n = 1606 each). Survival analysis demonstrated close resemblance of survival curves between original and synthetic cohorts. Inter-variable relationships were preserved in univariable outcome analysis enabling explorative analysis in our synthetic data. Additionally, training sample privacy is safeguarded mitigating possible patient re-identification, which we quantified using Hamming distances. We provide not only a proof-of-concept for synthetic data generation in multimodal clinical data for rare diseases, but also full public access to synthetic data sets to foster further research.
health care sciences & services,medical informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in clinical research, the acquisition of high - quality patient data is costly and restricted by privacy and regulatory issues. To address these problems, the authors use Generative Artificial Intelligence (GAI) technology to generate virtual datasets of synthetic Acute Myeloid Leukemia (AML) patients. Specifically, the authors' goals are: 1. **Overcome data acquisition barriers**: By generating synthetic data, the time and financial costs in actual data collection can be bypassed, and the limitations caused by privacy and regulatory issues can be reduced. 2. **Verify the quality of synthetic data**: Ensure that the generated synthetic data is highly similar to the real data in terms of statistical characteristics, variable distributions, and patient outcomes, so that it can be used for simulating clinical trials and exploratory analysis. 3. **Protect patient privacy**: Ensure that the generated synthetic data does not disclose the personal information of the original patients and prevent re - identification. 4. **Provide publicly accessible data resources**: Provide fully public synthetic datasets for researchers to promote the research of rare diseases such as AML. To this end, the authors adopted two generative models - CTAB - GAN+ and Normalizing Flows (NFlow), trained based on the real data of 1,606 AML patients from four multi - center clinical trials, and generated two synthetic patient cohorts of the same size. These models not only showed high fidelity in baseline characteristics and patient outcomes but also demonstrated good performance in survival analysis and privacy protection. ### Formula summary - **Hamming distance** is used to evaluate the privacy protection effect: \[ \text{Hamming Distance}=\sum_{i = 1}^{n}I(x_{i}\neq y_{i}) \] where \(I\) is the indicator function, and \(x_{i}\) and \(y_{i}\) represent the values of two data points on the \(i\)-th feature respectively. - **Kaplan - Meier Divergence** is used to evaluate the difference in survival curves: \[ \text{Kaplan - Meier Divergence}=1-\frac{\sum_{t}|S_{\text{synthetic}}(t)-S_{\text{real}}(t)|}{\sum_{t}S_{\text{real}}(t)} \] where \(S_{\text{synthetic}}(t)\) and \(S_{\text{real}}(t)\) are the survival probabilities of the synthetic data and the real data at time \(t\) respectively. - **Privacy Leakage Coefficient**: \[ \text{Privacy Leakage Coefficient}=\frac{\text{Average Hamming Distance (Synthetic to Test)}}{\text{Average Hamming Distance (Synthetic to Training)}} \] Through these methods, the authors have successfully demonstrated that the generated synthetic data can effectively protect patient privacy while retaining the biological characteristics of the real data, and provide publicly accessible synthetic datasets to promote further research.