GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data

Daniel Platnick,Sourena Khanzadeh,Alireza Sadeghian,Richard Anthony Valenzano
2024-05-01
Abstract:Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in micro - plastic research, the small and unbalanced data sets make it difficult for machine - learning methods to be effectively applied. Specifically, the ingestion or inhalation of micro - plastic particles poses an increasingly serious threat to human health, but current research methods are hindered by the lack of sufficient public data. In particular, deep - learning techniques face challenges when there are only small or unbalanced data sets. To solve this problem, the paper proposes the GANsemble framework, which aims to enhance small - scale and unbalanced micro - plastic data sets by generating synthetic data through conditional generative adversarial networks (cGANs), thereby improving the performance of the model. ### Main contributions of the paper: 1. **Proposing the GANsemble framework**: This is a two - module framework that combines data augmentation with conditional generative adversarial networks (cGANs) to generate class - conditional synthetic data. 2. **Data selector module**: Automatically selects the best data augmentation strategy and optimizes the data set by searching for the best combination strategy. 3. **cGAN module**: Trains cGAN using the selected best augmentation strategy to generate high - quality synthetic micro - plastic data. 4. **SYMP - Filter algorithm**: Post - processes the generated synthetic data to further improve its quality. 5. **Experimental verification**: Conducted experiments on small - scale and unbalanced micro - plastic data sets and established quality baselines (such as Fréchet Inception Distance (FID) and Inception Score (IS)) for synthetic micro - plastic data. ### Specific problems solved: - **Insufficient data**: Increase the size of the data set by generating synthetic data to alleviate the problem of insufficient data. - **Class imbalance**: Balance the data set by oversampling the minority class to improve the performance of the model on different classes. - **Data augmentation strategy selection**: Automatically select the best data augmentation strategy, reduce human intervention, and improve the generalization ability of the model. ### Experimental results: - **Data augmentation strategy selection**: Through the data selector module, the best data augmentation strategy (Aug∗) was found, which significantly improved the performance of the model. - **cGAN performance**: The cGAN trained with the synthetic data generated by Aug∗ performs excellently in generating synthetic micro - plastic data, and the generated images are superior to other methods in terms of feature variability, visual quality, and similarity to real data. - **Quality assessment**: The quality of the generated synthetic data was evaluated by the FID and IS indicators, and the results show that the data generated by Aug∗ achieves the best balance on these indicators. In conclusion, this paper effectively solves the problem of small and unbalanced data sets in micro - plastic research by proposing the GANsemble framework, providing an important foundation for subsequent research.