Abstract:The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the medical field, especially in the task of colon polyp localization, the performance of deep - learning models is limited due to scarce data. Specifically, the author explores how to use diffusion models to generate synthetic data to augment the dataset and investigates whether this synthetic data can effectively improve the performance of the model in the case of low - data volume. ### Main Problem Decomposition 1. **Data Scarcity Problem**: - The cost of data acquisition and annotation in the medical field is high and time - consuming, so the amount of available data is usually small. - Insufficient data volume will lead to insufficient training of deep - learning models, thereby affecting the generalization ability and prediction accuracy of the models. 2. **Effectiveness of Data Augmentation Methods**: - Traditional data augmentation techniques (such as rotation, flipping, etc.) can increase data diversity, but cannot ensure that the generated new samples are highly relevant to the original task. - Using generative models (such as diffusion models) to generate new samples similar to real data can alleviate the data scarcity problem to a certain extent. 3. **Quality Evaluation of Synthetic Data**: - Whether the generated synthetic data is realistic enough and can effectively improve the performance of downstream tasks (such as polyp localization). - How to measure the quality of the generated data and its impact on the model performance. ### Solutions The paper solves the problem through the following steps: 1. **Select and Process Multiple Public Datasets**: - Use four public datasets, namely LDPolyp, SUN, PolypGEN, and BKAI - IGH NeoPolyp - Small, as the basis. These datasets contain colonoscopic video frames and annotation information with different resolutions and qualities. - Perform deduplication on some datasets to reduce the impact of duplicate samples on the training of the diffusion model. 2. **Train the Diffusion Model to Generate Synthetic Data**: - Use the conditional latent diffusion model (LDM) combined with a pre - trained VAE (variational autoencoder) to convert images into latent - space representations. - Apply the diffusion process in the latent space to generate new images and corresponding annotation information (such as bounding boxes or segmentation masks). 3. **Design Multiple Experimental Schemes**: - VAE Upscaling: Crop and scale all images to the native resolution of LDPolyp (480×480), and then use VAE to upsample to the target resolution (640×640). - Fine - tuning: Downsample some datasets to 640×640, keep the native resolution of LDPolyp, first train the LDM and then fine - tune. - Alternate Batch/Epoch: Alternately use data with different resolutions for training. - Mixed Generated and Real: First train the LDM with LDPolyp, generate a large amount of synthetic data, and then jointly train with real data. 4. **Evaluate the Quality of the Generated Data**: - Use metrics such as FID (Frechet Inception Distance) and IS (Inception Score) to evaluate the quality of the generated data. - The results show that the quality of data generated by different training strategies varies, but generally, relatively realistic synthetic data can be generated. 5. **Transfer Learning Experiments**: - Use the YOLOv9 model to conduct transfer learning experiments for the polyp localization task, and compare the model performance in the case of using only a small amount of real data and combining synthetic data. - The experimental results show that in the case of low - data volume, combining synthetic data can significantly improve the model performance, especially when the data volume is small. ### Conclusions The research in this paper shows that the synthetic data generated by the diffusion model can effectively improve the performance of the polyp localization task in the case of low - data volume. Although the quality of the generated data is not necessarily perfect, its diversity helps the model to learn features better, thereby improving the generalization ability. Future work can further optimize the training strategy of the generative model to make the generated data closer to the real distribution and further reduce the need for data collection.

Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation from Scratch

Examining the Effect of Synthetic Data Augmentation in Polyp Detection and Segmentation.

Interpretability-guided Data Augmentation for Robust Segmentation in Multi-centre Colonoscopy Data

Synthetic Augmentation with Large-scale Unconditional Pre-training

Diffusion-based Data Augmentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images

Using Diffusion Models to Generate Synthetic Labelled Data for Medical Image Segmentation

GAN Inversion for Data Augmentation to Improve Colonoscopy Lesion Classification

Enabling Data Diversity: Efficient Automatic Augmentation via Regularized Adversarial Training

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

The Effectiveness of Data Augmentation for Detection of Gastrointestinal Diseases from Endoscopical Images

A Study of Deep Learning Colon Cancer Detection in Limited Data Access Scenarios

Joint one-sided synthetic unpaired image translation and segmentation for colorectal cancer prevention

Augmenting medical image classifiers with synthetic data from latent diffusion models

Highlighted Diffusion Model as Plug-in Priors for Polyp Segmentation

Effective Data Augmentation With Diffusion Models

Mask-conditioned latent diffusion for generating gastrointestinal polyp images

Data Augmentation Based on DiscrimDiff for Histopathology Image Classification

Colorectal polyp segmentation with denoising diffusion probabilistic models

Improvement of Colon Polyp Detection Performance by Modifying the Multi-scale Network Structure and Data Augmentation