Generative Expansion of Small Datasets: An Expansive Graph Approach

Vahid Jebraeeli,Bo Jiang,Hamid Krim,Derya Cansever
2024-10-02
Abstract:Limited data availability in machine learning significantly impacts performance and generalization. Traditional augmentation methods enhance moderately sufficient datasets. GANs struggle with convergence when generating diverse samples. Diffusion models, while effective, have high computational costs. We introduce an Expansive Synthesis model generating large-scale, information-rich datasets from minimal samples. It uses expander graph mappings and feature interpolation to preserve data distribution and feature relationships. The model leverages neural networks' non-linear latent space, captured by a Koopman operator, to create a linear feature space for dataset expansion. An autoencoder with self-attention layers and optimal transport refines distributional consistency. We validate by comparing classifiers trained on generated data to those trained on original datasets. Results show comparable performance, demonstrating the model's potential to augment training data effectively. This work advances data generation, addressing scarcity in machine learning applications.
Machine Learning,Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the data scarcity problem in machine learning**, especially the problems of insufficient performance and generalization ability when training models on small - scale datasets. Traditional data augmentation methods have limited effectiveness when dealing with extremely small datasets. Generative Adversarial Networks (GANs) have convergence problems when generating diverse samples, and although diffusion models are effective, they are computationally expensive. For this reason, the author proposes a new model - **Expansive Synthesis model**, aiming to generate large - scale, information - rich and distribution - consistent datasets from a small number of samples. This model achieves this goal through the following key techniques: 1. **Expander Graph Mapping**: Utilize the characteristics of the expander graph to generate diverse data points while maintaining the original data distribution. 2. **Feature Interpolation**: Ensure the reasonable distribution of generated data points in the feature space. 3. **Koopman Operator Theory**: Capture data features through the nonlinear latent space of the neural network and transform them into a linear feature space to promote the expansion of the dataset. 4. **Self - Attention Layers**: Improve feature extraction and ensure that the generated data points have stronger expressive ability and diversity. 5. **Optimal Transport**: Through techniques such as Wasserstein distance, ensure that the distribution of the generated data is consistent with the original data distribution. Specifically, the working process of the Expansive Synthesis model is as follows: - First, use a pre - trained autoencoder to encode the original dataset into a low - dimensional representation. - Then, apply the multi - head spatial self - attention to extract key features. - Next, use the expander graph mapping to generate new data points, ensuring that these new data points maintain the original data distribution in the feature space. - Finally, convert the generated data points back to the high - dimensional image space through a decoder to form the augmented dataset. The experimental results show that the performance of the classifier trained on the generated dataset is comparable to, or even better than, that of the classifier trained on the original dataset, thus proving the effectiveness of this model in solving the data scarcity problem. In summary, the main contribution of this paper is to provide an effective solution that can generate high - quality synthetic data on extremely small datasets, thereby improving the training effect and generalization ability of machine learning models.