Abstract:Each year, hundreds of clinical trials are conducted to evaluate new medical interventions, but sharing patient records from these trials with other institutions can be challenging due to privacy concerns and federal regulations. To help mitigate privacy concerns, researchers have proposed methods for generating synthetic patient data. However, existing approaches for generating synthetic clinical trial data disregard the usage requirements of these data, including maintaining specific properties of clinical outcomes, and only use post hoc assessments that are not coupled with the data generation process. In this paper, we propose SynRL which leverages reinforcement learning to improve the performance of patient data generators by customizing the generated data to meet the user-specified requirements for synthetic data outcomes and endpoints. Our method includes a data value critic function to evaluate the quality of the generated data and uses reinforcement learning to align the data generator with the users' needs based on the critic's feedback. We performed experiments on four clinical trial datasets and demonstrated the advantages of SynRL in improving the quality of the generated synthetic data while keeping the privacy risks low. We also show that SynRL can be utilized as a general framework that can customize data generation of multiple types of synthetic data generators. Our code is available at <a class="link-external link-https" href="https://anonymous.4open.science/r/SynRL-DB0F/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when generating synthetic clinical trial data, how to ensure that these data meet specific clinical endpoints and user requirements while maintaining high quality and low privacy risks. Existing synthetic data generation methods usually overlook the actual usage requirements of these data, such as maintaining specific clinical outcome properties, and only rely on post - hoc evaluations, which are not closely integrated with the data generation process. Specifically, the authors propose SynRL (Synthetic Reinforcement Learning), a method that uses reinforcement learning to improve the performance of patient data generators, so as to customize the generated data to meet the specific requirements of users for synthetic data results and endpoints. SynRL evaluates the quality of the generated data by introducing a data value evaluation function (or reward function), and uses reinforcement learning to adjust the data generator according to the evaluation feedback, making it better meet user needs. ### Summary of problems solved: 1. **Data sharing under privacy protection**: Hundreds of clinical trials are carried out every year, but sharing patient records in these trials faces privacy issues and regulatory restrictions. SynRL aims to generate synthetic patient data, thereby supporting in - depth data analysis while overcoming privacy concerns. 2. **Improving the quality of synthetic data**: The synthetic clinical trial data generated by existing methods fails to fully consider the requirements of downstream tasks, such as predicting mortality or the frequency of adverse events. SynRL optimizes the generated data through reinforcement learning to ensure better performance in downstream tasks. 3. **Customized synthetic data generation**: Users may be interested in certain specific clinical endpoints or results, such as death prediction or the frequency of adverse events. SynRL can generate customized synthetic data sets according to users' preferences, retaining these key features to support downstream tasks. Through experimental verification, SynRL has demonstrated its advantages on four clinical trial data sets, not only improving the quality of the generated data, but also maintaining a low privacy risk. In addition, as a general framework, SynRL can be applied to various types of synthetic data generators, demonstrating its flexibility and wide applicability.

SynRL: Aligning Synthetic Clinical Trial Data with Human-preferred Clinical Endpoints Using Reinforcement Learning

Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

Generating Synthetic Mixed-Type Longitudinal Electronic Health Records for Artificial Intelligent Applications

Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Synthetic Data Distillation Enables the Extraction of Clinical Information at Scale

Accelerating Clinical Evidence Synthesis with Large Language Models

Fairness-Optimized Synthetic EHR Generation for Arbitrary Downstream Predictive Tasks

Keeping synthetic patients on track: feedback mechanisms to mitigate performance drift in longitudinal health data simulation

Synthetic data for privacy-preserving clinical risk prediction

Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records

Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence

Synthetic Sample Selection via Reinforcement Learning

TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets

Controllable Synthetic Clinical Note Generation with Privacy Guarantees

Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling

DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime

Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results

Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking

Synthetic Data Generator for Adaptive Interventions in Global Health