SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems

Moon Ye-Bin,Nam Hyeon-Woo,Wonseok Choi,Nayeong Kim,Suha Kwak,Tae-Hyun Oh
2024-04-25
Abstract:Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of imbalanced training data. Data imbalance can lead to biased predictions from the trained model, which in turn can cause ethical and social issues. Although this problem can be mitigated by carefully selecting training data, such an approach is time-consuming and impractical given the large scale of modern neural networks. Therefore, inspired by recent advances in generative models, this paper explores the potential of using synthetic data to address the problem of data imbalance. Specifically, the method, called SYNAuG, aims to equalize the distribution of training data by generating synthetic data. Experiments show that although there is a certain domain gap between real and synthetic data, SYNAuG performs well in handling tasks with different data imbalance issues after fine-tuning with a small amount of real samples, and it surpasses existing task-specific methods. In summary, SYNAuG aims to alleviate the problem of data imbalance through synthetic data, thereby improving the performance, fairness, and robustness of the model.