TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

Andrei Margeloiu,Adrián Bazaga,Nikola Simidjievski,Pietro Liò,Mateja Jamnik
2024-07-29
Abstract:Tabular data is prevalent in many critical domains, yet it is often challenging to acquire in large quantities. This scarcity usually results in poor performance of machine learning models on such data. Data augmentation, a common strategy for performance improvement in vision and language tasks, typically underperforms for tabular data due to the lack of explicit symmetries in the input space. To overcome this challenge, we introduce TabMDA, a novel method for manifold data augmentation on tabular data. This method utilises a pre-trained in-context model, such as TabPFN, to map the data into an embedding space. TabMDA performs label-invariant transformations by encoding the data multiple times with varied contexts. This process explores the learned embedding space of the underlying in-context models, thereby enlarging the training dataset. TabMDA is a training-free method, making it applicable to any classifier. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various tabular datasets. Our results demonstrate that TabMDA provides an effective way to leverage information from pre-trained in-context models to enhance the performance of downstream classifiers. Code is available at <a class="link-external link-https" href="https://github.com/AdrianBZG/TabMDA" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is how to improve the performance of machine - learning models on small - scale tabular data sets. Specifically, tabular data is very common in many key fields such as medicine, physics, and chemistry, but obtaining a large amount of data is usually very expensive or impossible. Therefore, training effective machine - learning models becomes challenging. Although data augmentation (DA) techniques are widely used in visual and language tasks to improve model performance, for tabular data, due to the lack of clear data symmetry and heterogeneity, existing DA methods usually perform poorly or even degrade model performance. To address this challenge, the paper proposes TabMDA (Tabular Manifold Data Augmentation), a new manifold data augmentation method for tabular data. TabMDA utilizes pre - trained context models (e.g., TabPFN) to map data into an embedding space and performs label - invariant transformations by encoding the data multiple times and using different contexts. This allows the training data set to be expanded, thereby indirectly incorporating pre - trained knowledge into the downstream classifier without an additional training process. Experimental results show that TabMDA can significantly improve the performance of multiple standard classifiers on different tabular data sets while reducing the performance differences between different classifiers. ### Main Contributions 1. **TabMDA**: A new training - free data augmentation method that jointly embeds and augments data through pre - trained context models and is applicable to any classifier. 2. **In - context Subsetting (ICS)**: A novel technique that generates label - invariant transformations by multiple context encodings and explores the manifold space learned by the pre - trained model. ### Method Overview The core idea of TabMDA is to use a pre - trained tabular context model (such as TabPFN) to embed real data into the manifold space and generate diverse embeddings through in - context subsetting. The specific steps are as follows: - **Embed Data**: Use the encoder part of the pre - trained TabPFN model to map the input sample \(x\) into its latent space. - **In - context Subsetting (ICS)**: Generate multiple contexts by stratified sampling of the training data, each context containing \(N_{ctx}\) points. For each data point \(x\) to be augmented, use these different contexts to encode it and generate the augmented embedding \(x^{(k)}_{aug}\). - **Train Downstream Classifier**: Use the augmented data set to train the downstream classifier, which is \(K\) times the size of the original data set in the embedding space. In this way, TabMDA not only improves the performance of the classifier but also enhances the generalization ability and stability of the model, especially for small - scale tabular data sets.