Abstract:Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in micro - plastic research, the small and unbalanced data sets make it difficult for machine - learning methods to be effectively applied. Specifically, the ingestion or inhalation of micro - plastic particles poses an increasingly serious threat to human health, but current research methods are hindered by the lack of sufficient public data. In particular, deep - learning techniques face challenges when there are only small or unbalanced data sets. To solve this problem, the paper proposes the GANsemble framework, which aims to enhance small - scale and unbalanced micro - plastic data sets by generating synthetic data through conditional generative adversarial networks (cGANs), thereby improving the performance of the model. ### Main contributions of the paper: 1. **Proposing the GANsemble framework**: This is a two - module framework that combines data augmentation with conditional generative adversarial networks (cGANs) to generate class - conditional synthetic data. 2. **Data selector module**: Automatically selects the best data augmentation strategy and optimizes the data set by searching for the best combination strategy. 3. **cGAN module**: Trains cGAN using the selected best augmentation strategy to generate high - quality synthetic micro - plastic data. 4. **SYMP - Filter algorithm**: Post - processes the generated synthetic data to further improve its quality. 5. **Experimental verification**: Conducted experiments on small - scale and unbalanced micro - plastic data sets and established quality baselines (such as Fréchet Inception Distance (FID) and Inception Score (IS)) for synthetic micro - plastic data. ### Specific problems solved: - **Insufficient data**: Increase the size of the data set by generating synthetic data to alleviate the problem of insufficient data. - **Class imbalance**: Balance the data set by oversampling the minority class to improve the performance of the model on different classes. - **Data augmentation strategy selection**: Automatically select the best data augmentation strategy, reduce human intervention, and improve the generalization ability of the model. ### Experimental results: - **Data augmentation strategy selection**: Through the data selector module, the best data augmentation strategy (Aug∗) was found, which significantly improved the performance of the model. - **cGAN performance**: The cGAN trained with the synthetic data generated by Aug∗ performs excellently in generating synthetic micro - plastic data, and the generated images are superior to other methods in terms of feature variability, visual quality, and similarity to real data. - **Quality assessment**: The quality of the generated synthetic data was evaluated by the FID and IS indicators, and the results show that the data generated by Aug∗ achieves the best balance on these indicators. In conclusion, this paper effectively solves the problem of small and unbalanced data sets in micro - plastic research by proposing the GANsemble framework, providing an important foundation for subsequent research.

GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data

Microplastic Identification Using AI-Driven Image Segmentation and GAN-Generated Ecological Context

Microplastics and nanoplastics analysis: Options, imaging, advancements and challenges

Ensemble Data Augmentation for Imbalanced Fault Diagnosis.

Towards Generating Large Synthetic Phytoplankton Datasets for Efficient Monitoring of Harmful Algal Blooms

GANs in the Panorama of Synthetic Data Generation Methods

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

Microplastic predictive modelling with the integration of Artificial Neural Networks and Hidden Markov Models (ANN-HMM)

Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study

Comprehensive Exploration of Synthetic Data Generation: A Survey

Automatic quantification and classification of microplastics in scanning electron micrographs via deep learning

A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

phylaGAN: data augmentation through conditional GANs and autoencoders for improving disease prediction accuracy using microbiome data

Morphological Detection and Classification of Microplastics and Nanoplastics Emerged from Consumer Products by Deep Learning

Enhancement of Image Classification Using Transfer Learning and GAN-Based Synthetic Data Augmentation

Data synthesis based on generative adversarial networks

FairGen: Fair Synthetic Data Generation

Imbalanced spectral data analysis using data augmentation based on the generative adversarial network

MisGAN: Learning from Incomplete Data with Generative Adversarial Networks

Generative Adversarial Networks for Data Augmentation

A Cloud-Based Framework for Large-Scale Monitoring of Ocean Plastics Using Multi-Spectral Satellite Imagery and Generative Adversarial Network