Abstract:Recent advancements in generative artificial intelligence have shown promise in producing realistic images from complex data distributions. We developed a denoising diffusion probabilistic model trained on the CheXchoNet dataset, encoding the joint distribution of demographic data and echocardiogram measurements. We generated a synthetic dataset skewed towards younger patients with a higher prevalence of structural left ventricle disease. A diagnostic deep learning model trained on the synthetic dataset performed comparably to one trained on real data producing an AUROC=0.75(95% CI 0.72-0.77), with similar performance on an internal dataset. Combining real data with positive samples from the synthetic data improved diagnostic accuracy producing an AUROC=0.80(95% CI 0.78-0.82). Subgroup analysis showed the largest performance improvement across younger patients. These results suggest diffusion models can increase diagnostic accuracy and fine-tune models for specific populations.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to generate synthetic datasets by using the Denoising Diffusion Probabilistic Model (DDPM) to improve the diagnostic performance in detecting structural heart diseases, such as Severe Left Ventricular Hypertrophy (SLVH) and Dilated Left Ventricle (DLV). Specifically, the paper attempts to solve the following key problems: 1. **Data insufficiency and bias problems**: - **Data scarcity**: Medical data are often scarce and difficult to obtain, which limits the training and generalization ability of machine - learning models. - **Data bias**: Existing datasets may contain biases, for example, being under - represented in terms of age, gender, etc., resulting in poor performance of the model in specific populations. 2. **Improving diagnostic accuracy**: - **Generating high - quality synthetic data**: By generating synthetic datasets, especially increasing the proportion of samples of young patients and specific diseases, to improve the diagnostic accuracy of the model. - **Combining real and synthetic data**: The research found that a model trained with a combination of real data and synthetic data can significantly improve diagnostic performance, especially in the target population. 3. **Model generalization ability**: - **Improving model generalization ability**: By generating diverse synthetic data, reducing the over - fitting of the model to the training data and improving its generalization ability on new datasets. ### Main methods and results 1. **Data generation**: - Use DDPM to train a model and generate a synthetic dataset that is biased towards young patients and a higher proportion of structural left ventricular diseases. - The age distribution and disease characteristics of the synthetic dataset are different from those of the original dataset to increase the representation of specific age groups and diseases. 2. **Model training and evaluation**: - Train a deep - learning model using the real dataset, the synthetic dataset, a combination of the real and synthetic datasets, and a combination of positive samples in the real data and synthetic data respectively. - Evaluate the performance of the model on different datasets. The main indicators include AUROC (Area Under the Receiver Operating Characteristic Curve) and AUPRC (Area Under the Precision - Recall Curve). 3. **Result analysis**: - **Performance of the synthetic dataset**: The performance of the model trained only with the synthetic dataset is comparable to that of the model trained only with the real dataset, indicating that synthetic data can effectively replace real data. - **Performance of the combined datasets**: The model trained with a combination of real data and synthetic data has improved in almost all indicators, especially in the young patient group, with an increase of 3.7% in AUROC. - **Internal - cohort validation**: Validate the model performance on the internally - collected dataset. The results show that the model performs better on this dataset, especially with an AUROC of 0.84 on the composite label. ### Conclusion This research has successfully improved the diagnostic performance in detecting structural heart diseases by generating synthetic datasets, especially in the young patient group. In addition, the use of synthetic data not only increases the diversity of data but also reduces the over - fitting of the model and improves its generalization ability. These results indicate that diffusion models have great potential in generating synthetic data and can be further applied to research in other medical fields.

Denoising diffusion model for increased performance of detecting structural heart disease

Counterfactual MRI Generation with Denoising Diffusion Models for Interpretable Alzheimer's Disease Effect Detection

Counterfactual MRI Generation with Denoising Diffusion Models for Interpretable Alzheimer’s Disease Effect Detection

MedDiff: Generating Electronic Health Records using Accelerated Denoising Diffusion Model

Efficient Semantic Diffusion Architectures for Model Training on Synthetic Echocardiograms

Debiasing Cardiac Imaging with Controlled Latent Diffusion Models

Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

Deep Learning Discovery of Demographic Biomarkers in Echocardiography

Optimizing Object Detection Algorithms for Congenital Heart Diseases in Echocardiography: Exploring Bounding Box Sizes and Data Augmentation Techniques

High-resolution MRI synthesis using a data-driven framework with denoising diffusion probabilistic modeling

Evaluating Synthetic Diffusion MRI Maps created with Diffusion Denoising Probabilistic Models

Echo from noise: synthetic ultrasound image generation using diffusion models for real image segmentation

DiffECG: A Versatile Probabilistic Diffusion Model for ECG Signals Synthesis

Denoising diffusion probabilistic models for 3D medical image generation

Boosting Cardiac Color Doppler Frame Rates with Deep Learning

Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio

2D medical image synthesis using transformer-based denoising diffusion probabilistic model

GH-DDM: the generalized hybrid denoising diffusion model for medical image generation

A Domain Translation Framework with an Adversarial Denoising Diffusion Model to Generate Synthetic Datasets of Echocardiography Images

Automated chest screening based on a hybrid model of transfer learning and convolutional sparse denoising autoencoder

Multi-modality deep learning model for prediction of chronic obstructive coronary artery disease