Azizjon Azimi,Bonu Boboeva,Ilyas Varshavskiy,Shuhrat Khalilbekov,Akhlitdin Nizamitdinov,Najima Noyoftova,Sergey Shulgin
Abstract:The phenomenon of "black swans" has posed a fundamental challenge to performance of classical machine learning models. The perceived rise in frequency of outlier conditions, especially in post-pandemic environment, has necessitated exploration of synthetic data as a complement to real data in model training. This article provides a general overview and experimental investigation of the zGAN model architecture developed for the purpose of generating synthetic tabular data with outlier characteristics. The model is put to test in binary classification environments and shows promising results on realistic synthetic data generation, as well as uplift capabilities vis-à-vis model performance. A distinctive feature of zGAN is its enhanced correlation capability between features in the generated data, replicating correlations of features in real training data. Furthermore, crucial is the ability of zGAN to generate outliers based on covariance of real data or synthetically generated covariances. This approach to outlier generation enables modeling of complex economic events and augmentation of outliers for tasks such as training predictive models and detecting, processing or removing outliers. Experiments and comparative analyses as part of this study were conducted on both private (credit risk in financial services) and public datasets.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the ability of Generative Adversarial Networks (GANs) in generating realistic synthetic data with outlier characteristics, especially their performance when dealing with "black swan" events (i.e., rare and unpredictable events). Specifically, this research aims to:
1. **Address data scarcity and privacy issues**: With the improvement of data collection capabilities, while data availability increases, data privacy and access restrictions have also become more stringent. This is especially true for data involving personal identification information, trade secrets, or intellectual property rights. Therefore, a method is needed to generate synthetic data to supplement real data, thereby alleviating the problem of data scarcity.
2. **Enhance the model's ability to handle outliers**: Traditional machine - learning models experience a decline in performance when facing outliers, especially in the post - pandemic environment where the frequency of abnormal situations has increased. In order to improve the model's performance under such conditions, it is necessary to generate synthetic data containing outliers to train the model so that it can better identify, handle, or remove outliers.
3. **Improve the replication of feature correlations**: Existing GANs often fail to well preserve the correlations between features in the original data when generating synthetic data. zGAN ensures that the generated synthetic data is closer to the real distribution pattern by enhancing the correlations between features.
4. **Generate outliers with specific distribution tails**: Based on Extreme Value Theory (EVT), zGAN can generate outliers that conform to different distribution tails (such as light - tailed, bounded - tailed, and heavy - tailed) to simulate complex economic events and other rare events.
### Main contributions of zGAN
- **Generate synthetic tabular data with outlier characteristics**: zGAN focuses on generating synthetic data containing outliers to supplement the information value of historical training data.
- **Improve the predictability of rare events**: By generating outliers, zGAN helps in modeling fundamentally new events, further analyzing and improving the model's predictive ability for future rare events.
- **Enhance the stability of model training**: Generating outliers can enhance existing data sets, making the trained model more stable and having the ability to detect, remove, or handle outliers.
- **Ensure data privacy**: By using a hash similarity filter, zGAN ensures that the generated synthetic data does not leak real customer data.
### Experimental verification
This research experimentally verified the effectiveness of zGAN in generating synthetic data with outliers and demonstrated its performance improvement in binary classification tasks. The experimental results show that zGAN outperforms other GAN models in most cases, especially when dealing with data containing outliers.
### Summary
zGAN solves the deficiencies of traditional GANs in dealing with rare events and outliers by generating synthetic data with outlier characteristics, improves the model's predictive ability and stability, and at the same time ensures data privacy. This provides new tools and methods for data analysis in fields such as finance and healthcare.