Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

Adrian Tormos,Blanca Llauradó,Fernando Núñez,Axel Romero,Dario Garcia-Gasulla,Javier Béjar
2024-11-28
Abstract:The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the medical field, especially in the task of colon polyp localization, the performance of deep - learning models is limited due to scarce data. Specifically, the author explores how to use diffusion models to generate synthetic data to augment the dataset and investigates whether this synthetic data can effectively improve the performance of the model in the case of low - data volume. ### Main Problem Decomposition 1. **Data Scarcity Problem**: - The cost of data acquisition and annotation in the medical field is high and time - consuming, so the amount of available data is usually small. - Insufficient data volume will lead to insufficient training of deep - learning models, thereby affecting the generalization ability and prediction accuracy of the models. 2. **Effectiveness of Data Augmentation Methods**: - Traditional data augmentation techniques (such as rotation, flipping, etc.) can increase data diversity, but cannot ensure that the generated new samples are highly relevant to the original task. - Using generative models (such as diffusion models) to generate new samples similar to real data can alleviate the data scarcity problem to a certain extent. 3. **Quality Evaluation of Synthetic Data**: - Whether the generated synthetic data is realistic enough and can effectively improve the performance of downstream tasks (such as polyp localization). - How to measure the quality of the generated data and its impact on the model performance. ### Solutions The paper solves the problem through the following steps: 1. **Select and Process Multiple Public Datasets**: - Use four public datasets, namely LDPolyp, SUN, PolypGEN, and BKAI - IGH NeoPolyp - Small, as the basis. These datasets contain colonoscopic video frames and annotation information with different resolutions and qualities. - Perform deduplication on some datasets to reduce the impact of duplicate samples on the training of the diffusion model. 2. **Train the Diffusion Model to Generate Synthetic Data**: - Use the conditional latent diffusion model (LDM) combined with a pre - trained VAE (variational autoencoder) to convert images into latent - space representations. - Apply the diffusion process in the latent space to generate new images and corresponding annotation information (such as bounding boxes or segmentation masks). 3. **Design Multiple Experimental Schemes**: - VAE Upscaling: Crop and scale all images to the native resolution of LDPolyp (480×480), and then use VAE to upsample to the target resolution (640×640). - Fine - tuning: Downsample some datasets to 640×640, keep the native resolution of LDPolyp, first train the LDM and then fine - tune. - Alternate Batch/Epoch: Alternately use data with different resolutions for training. - Mixed Generated and Real: First train the LDM with LDPolyp, generate a large amount of synthetic data, and then jointly train with real data. 4. **Evaluate the Quality of the Generated Data**: - Use metrics such as FID (Frechet Inception Distance) and IS (Inception Score) to evaluate the quality of the generated data. - The results show that the quality of data generated by different training strategies varies, but generally, relatively realistic synthetic data can be generated. 5. **Transfer Learning Experiments**: - Use the YOLOv9 model to conduct transfer learning experiments for the polyp localization task, and compare the model performance in the case of using only a small amount of real data and combining synthetic data. - The experimental results show that in the case of low - data volume, combining synthetic data can significantly improve the model performance, especially when the data volume is small. ### Conclusions The research in this paper shows that the synthetic data generated by the diffusion model can effectively improve the performance of the polyp localization task in the case of low - data volume. Although the quality of the generated data is not necessarily perfect, its diversity helps the model to learn features better, thereby improving the generalization ability. Future work can further optimize the training strategy of the generative model to make the generated data closer to the real distribution and further reduce the need for data collection.