Abstract:Supervised machine learning methods require large-scale training datasets to perform well in practice. Synthetic data has been showing great progress recently and has been used as a complement to real data. However, there is yet a great urge to assess the usability of synthetically generated data. To this end, we propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets, surpassing existing traditional metrics. By utilizing a UCB-based dynamic approach ensures continual enhancement of model learning. Unlike other approaches, our method effectively adapts to changes in the machine learning model's state and considers the evolving utility of training samples during the training process. We show that our metric is an effective way to rank synthetic images based on their usability. Furthermore, we propose a new attribute-aware bandit pipeline for generating synthetic data by integrating a Large Language Model with Stable Diffusion. Quantitative results show that our approach can boost the performance of a wide range of supervised classifiers. Notably, we observed an improvement of up to 10% in classification accuracy compared to traditional approaches, demonstrating the effectiveness of our approach. Our source code, datasets, and additional materials are publically available at <a class="link-external link-https" href="https://github.com/A-Kerim/Synthetic-Data-Usability-2024" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively use synthetic data to train machine - learning models so that they perform well in practical tasks. Although synthetic data generation techniques have made significant progress and can generate synthetic data on a large scale, there is a domain gap between these synthetic data and real data, which results in that even models trained on large - scale synthetic data sets often have unsatisfactory performance in practical applications. Therefore, the paper proposes an improved method, aiming to generate high - quality synthetic images, evaluate their usability, and use synthetic data and real data jointly for training. ### Main contributions: 1. **Propose a dynamic adaptability measurement method**: It is used to evaluate the usability of synthetic images. This measurement method combines low - level and high - level information and can evaluate the quality of synthetic data more comprehensively. 2. **Introduce a UCB (Upper Confidence Bound) - based dynamic selection method**: Dynamically select the most appropriate training samples in each training cycle to optimize the learning process of the model. 3. **Propose a new property - aware Multi - Armed Bandit data generation pipeline**: Generate diverse and high - quality synthetic data sets by integrating large - language models (LLMs) and Stable Diffusion models. ### Method overview: 1. **Attribute extraction**: Use large - language models (LLMs) to extract main attributes from a given domain context, for example, when generating a car accident data set, extract the most common car colors and models. 2. **Prompt creation**: Randomly sample attributes from the extracted attribute pool and use them as input parameters for the prompt template of the Stable Diffusion model to generate specific prompts. 3. **Data generation**: Use the Stable Diffusion model to generate the required synthetic images according to the generated prompts. ### Usability metrics: - **Diversity and Photorealism Score (DPS)**: Evaluate the visual quality and diversity of images. - **Feature Cohesion Score (FCS)**: Evaluate the consistency between synthetic features and real features. ### Dynamic selection method: - **UCB - based dynamic selection**: In each training cycle, by calculating the upper confidence bound (UCB) of each synthetic sample, dynamically select the most valuable samples for training. This method not only enhances the sample selection process but also optimizes the performance of the model by focusing on the samples that are most beneficial to the current model state. ### Experimental results: - Through quantitative experiments, the paper shows the effectiveness of this method on multiple supervised classifiers. In particular, in terms of classification accuracy, compared with traditional methods, this method improves the performance by up to 10%. In conclusion, the paper proposes an innovative method. By dynamically evaluating and selecting synthetic data, it effectively solves the domain gap problem between synthetic data and real data, thereby improving the performance of the model in practical tasks.

Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

Synthetica: Large Scale Synthetic Data for Robot Perception

Synthetic data augmentation for robotic mobility aids to support blind and low vision people

SynthDa: Exploiting Existing Real-World Data for Usable and Accessible Synthetic Data Generation

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

Synthetic Image Data for Deep Learning

Improving the Effectiveness of Deep Generative Data

On the Equivalency, Substitutability, and Flexibility of Synthetic Data

Feedback-guided Data Synthesis for Imbalanced Classification

Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets

Synthetic Data for Object Classification in Industrial Applications

Optimising Individual-Treatment-Effect Using Bandits

SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Learning to Generate Synthetic Data via Compositing

Synthetic Data for Model Selection

Robust Disaster Assessment from Aerial Imagery Using Text-to-Image Synthetic Data

Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization

Efficacy of Synthetic Data as a Benchmark

Exploring the Potential of Synthetic Data to Replace Real Data