Synthetic Data Approach for Classification and Regression

Yang Yue,Ying Li,Kexin Yi,Zhonghai Wu
DOI: https://doi.org/10.1109/asap.2018.8445094
2018-01-01
Abstract:The goal of this paper is to automatically generate synthetic data to enable data analyzers to cope with the problem of insufficient data. Taking the most typical machine learning tasks, classification and regression, as an example, limited and insufficient samples cause low generalization of machine learning models, which cannot provide reasonable predictions. Data are insufficient either because of sample rarity or because data are impeded to be accessed for privacy concerns or confidential protection. To overcome this, we present a Synthetic Data Approach for Classification and Regression, adopting probability distribution and k-nearest neighbor model to generate synthetic data. We first estimate the probability distribution of each feature and construct a k-nearest neighbor model for all original data samples. Then we generate random samples based on probability distributions, adopt the k-nearest neighbor model to validate these random samples, and output the synthetic samples. We use proposed synthetic approaches to generate synthetic data of five publicly available datasets for classification and regression, respectively, and evaluate the performance of machine learning models to evaluate the resemblance between synthetic data and original data. The experimental results show that the synthetic data can resemble the original data, which indicates it is an effective approach for data analyzer to overcome the problem of insufficient data.
What problem does this paper attempt to address?