Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat,Mohammad Pezeshki,Florian Bordes,Michal Drozdzal,Adriana Romero-Soriano
2024-09-10
Abstract:Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **imbalanced classification problem** in machine learning. Specifically, the authors focus on how to improve the classification performance on long - tailed distribution and group - imbalanced datasets by generating synthetic data. Traditional methods usually use static real - image datasets for training, and these datasets often come from long - tailed distributions, resulting in the number of samples in some classes being far less than that in other classes, thus affecting the performance of the classifier. To solve this problem, the authors propose a new framework that uses pre - trained generative models (such as diffusion models) to generate useful synthetic data and uses the one - shot feedback of the classifier to guide the generation process. This feedback mechanism can ensure that the generated synthetic samples are close to the distribution of real data and have sufficient diversity, thereby improving the performance of the classifier on imbalanced datasets. ### Main contributions 1. **Introducing the feedback mechanism**: The authors propose a feedback - guided data - generation strategy, which generates synthetic samples useful for classification tasks by obtaining feedback from pre - trained classifiers. 2. **Emphasizing the diversity and authenticity of samples**: The study found that in order to make the feedback of the classifier effective, the synthetic data must be close to the support area of the real - data distribution and have sufficient diversity. 3. **Experimental proof of significant effects**: On datasets such as ImageNet - LT, Places - LT and NICO++, this method has achieved state - of - the - art results, especially on minority classes, with a significant performance improvement. For example, on ImageNet - LT, for minority classes, the performance has been improved by more than 4%, and the amount of synthetic data used has been reduced by 50%; on NICO++, the accuracy of the worst group has been improved by more than 5%. ### Method overview - **Initial training**: First, train a classifier \(f_{\phi}\) on an imbalanced real dataset \(D_{\text{real}}\). - **Generating synthetic data**: Use a pre - trained diffusion model \(G_{\theta}\), combined with text prompts and randomly selected real images as conditions, to generate synthetic data \(D_{\text{syn}}\). During the generation process, the feedback signal \(C(f_{\phi})\) provided by the classifier \(f_{\phi}\) is used to guide the generation process to generate samples useful for the classification task. - **Retraining the classifier**: Finally, retrain the classifier \(f_{\phi}\) using the union of real data and synthetic data \(D_{\text{real}}\cup D_{\text{syn}}\). ### Feedback criteria The authors explore three feedback criteria to promote the generation of samples useful for the classifier: 1. **Classifier loss**: Use the loss of the classifier on the generated samples as a feedback signal. \[ C(x, y, f_{\phi})=L(f_{\phi}(x), y) \] 2. **Prediction entropy**: Use the prediction entropy of the classifier on the generated samples as a feedback signal to encourage the generation of samples that the classifier is uncertain about. \[ C(x, y, f_{\phi}) = H(f_{\phi}(x)) \] 3. **Difficulty score**: Use the difficulty score proposed by Sehwag et al., which quantifies the challenge of the generated samples to the classifier. \[ HS(x, y, f_{\phi})=\frac{1}{2}\left[(f_{\phi}(x)-\mu_{y})^{T}\Sigma_{y}^{-1}(f_{\phi}(x)-\mu_{y})+\ln(\det(\Sigma_{y})) + k\ln(2\pi)\right] \] Through these feedback criteria, the authors show how to effectively generate synthetic data useful for classification tasks, thereby significantly improving the performance of the classifier on imbalanced datasets.